In [ ]:
"""This area sets up the Jupyter environment.
Please do not modify anything in this cell.
"""
import os
import sys

# Add project to PYTHONPATH for future use
sys.path.insert(1, os.path.join(sys.path[0], '..'))

# Import miscellaneous modules
from IPython.core.display import display, HTML

# Set CSS styling
with open('../admin/custom.css', 'r') as f:
    style = """<style>\n{}\n</style>""".format(f.read())
    display(HTML(style))

Outline

The following notebook will go through the basics of **supervised learning**. This will work as an introduction to the Python programming language as well as the Python package Keras which we will be using to create artificial neural networks later on.

In supervised learning we assume that our data consist of input - output pairs. A learning algorithm analyses the data and produces a function, or model, we can use to infer outputs given unseen future inputs.

Below we can see a simplified illustration of the supervised learning problem.

Pairs of inputs $\mathbf{x}$ and outputs $y$ constitutes our training examples, where the inputs are sampled from a probability distribution. A pair $(\mathbf{x}, y)$ is related by an unknown target function $f$ governed by a conditional probability distribution. The ultimate goal of supervised learning is to learn a function $g$ which approximates $f$ well.

The particular approximation $g$ we pick is called a hypothesis. A learning algorithm is responsible for picking the most appropriate hypothesis from a hypothesis set. The decision between which hypothesis to pick is done by looking at the data and typically involves an error function which measures how good a hypothesis may be.

When our learning algorithm has picked a good hypothesis, we can feed it new and unseen samples to produce output estimates.

The name of the data typically differ depending on which area you are from.

The input variables are commonly known as:

- covariates
- predictors
- features 

The output variables are commonly known as:

- variates
- targets
- labels

Linear Models: Regression

For now we will focus on one of the simplest supervised learning problems: linear regression.

A linear regression model learns a real-valued function where one or more dependent output variable(s) depend linearly on one or more independent input variable(s). Geometrically, this real-valued function can be interpreted as a hyperplane which we attempt to fit to our data.

Motivation

  • Allows us to investigate the relationship between two or more variables statistically
  • Can be thought of as a building block of artificial neural networks
  • A solution can be found analytically or using data-driven optimisation
  • Basic introduction to supervised learning
  • Introduces you to the Python programming language and Jupyter notebook usage

Notation

This notebook will use the following notation:

  • A (training) dataset has $N$ input - output pairs: $(\mathbf{x}_i, y_i)$, where $i$ signifies the $i$th example
  • Each input $\mathbf{x}_i$ is a $d$ dimensional column vector: $\mathbf{x}_i \in \mathbb{R}^d$
  • For this notebook we will assume the output to be univariate: $y \in \mathbb{R}$

Keep in mind that additional notation will be introduced as we continue through the notebooks.

Example: Income vs. Education

In the following example we will load data from a CSV file and use to estimate a linear model between an Education index and a Income index.

  • input $\rightarrow$ Scalar metric indicating level of education
  • output $\rightarrow$ Scalar metric indication level of income
In the follow code snippets we will:
  • Load data from a CSV file
  • Plot the data

First, let's begin by importing a selection of Python package that will prove useful for the rest of this Jupyter notebook.


In [ ]:
# Plots will be show inside the notebook
%matplotlib notebook
import matplotlib.pyplot as plt

# NumPy is a package for manipulating N-dimensional array objects 
import numpy as np

# Pandas is a data analysis package
import pandas as pd

import problem_unittests as tests

With Pandas we can load the aforementioned CSV data.


In [ ]:
# Load data and print the first n = 5 rows
# URL: http://www-bcf.usc.edu/~gareth/ISL/Income1.csv
DATA_URL = './resources/Income1.csv'
data = pd.read_csv(DATA_URL, index_col=0)

print(data.head(n=5))

# Put the second (education index) and third (income index) row in a NumPy array
X_data = data['Education'].values
y_data = data['Income'].values

With the data loaded we can plot it as a scatter plot using matplotlib.


In [ ]:
plt.figure()

plt.scatter(X_data, y_data, label='Training data')

plt.title('Education vs. Income')
plt.xlabel('Education index')
plt.ylabel('Income index')
plt.grid(linestyle='dotted')
plt.legend()

plt.show()

Modelling

As previously mentioned, we will be using a linear model. That is, the output will be a linear combination of the input plus a bias or intercept:

$$ \begin{equation*} g(\mathbf{x}) = b + \sum_{j=1}^{d}w_jx_j \end{equation*} $$

Keep in mind that in this problem there is only a single independent variable $\mathbf{x}$, which means the above can be simplified to: $g(x) = b + wx$, where $b$ is the intercept and $w$ is the slope.

Notational Simplifications

To simplify notation, it is quite common to merge the bias $b$ with the weights $w_i$ to get a single weight vector $\mathbf{w} = (w_0, w_1, \ldots, w_d)^\intercal$, where $w_0 = b$. Consequently, an extra dimension must be prepended to the input vector, i.e. $\mathbf{x} = (1, x_1, \ldots, x_d)^\intercal$.

With this simplification the linear model can be written as:

$$ \begin{equation*} g(\mathbf{x}) = \sum_{j=1}^{d}w_jx_j \end{equation*} $$

Matrix Form

The above model takes a single input $\mathbf{x}$ and produces a single output prediction. We can take this one step further by putting all of the input examples in a single matrix called the design matrix $\mathbf{X}$. This matrix consists of one (training) example per row.


$$ \begin{equation*} \mathbf{X} = \begin{bmatrix} 1 & \mathbf{x}_{11} & \cdots & \mathbf{x}_{1d} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & \mathbf{x}_{N1} & \cdots & \mathbf{x}_{Nd} \end{bmatrix} = \left[ \begin{array}{c} \mathbf{x}_{1}^\intercal \\ \vdots\\ \mathbf{x}_{N}^\intercal\end{array} \right] \end{equation*} $$

With the design matrix, predictions can be done by matrix multiplication:


$$ \begin{equation*} \hat{\mathbf{y}} = \mathbf{X}\mathbf{w} = \begin{bmatrix} 1 & \mathbf{x}_{11} & \cdots & \mathbf{x}_{1d} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & \mathbf{x}_{N1} & \cdots & \mathbf{x}_{Nd} \end{bmatrix} \left[ \begin{array}{c} \mathbf{w}_{0} \\ \mathbf{w}_{1} \\ \vdots\\ \mathbf{w}_{d}\end{array} \right] = \left[ \begin{array}{c} y_{1} \\ y_{2} \\ \vdots\\ y_{N}\end{array} \right] \end{equation*} $$

Defining an Error Function

To measure how well our hypothesis, i.e. a particular set of weights, approximates the unknown target function $f$ we will have to come up with an error function. This quantification, which we will call $J$, goes by several different names:

  • Cost
  • Energy
  • Error
  • Loss
  • Objective

We will be using squared error: $(g(\mathbf{x}) - f(\mathbf{x}))^2$ to measure how well our hypothesis approximates $f$. Seeing as we do not have access to $f$ we will instead compute an in-sample squared error over all our training data. This measure is commonly known as mean squared error (MSE):

$$ \begin{equation*} J(\mathbf{w}) = \frac{1}{N}\sum_{i=1}^{N}(g(\mathbf{x}_i) - y_i)^2 = \frac{1}{N}\sum_{i=1}^{N}(\mathbf{w}^\intercal \mathbf{x}_i - y_i)^2 = \frac{1}{N}\lVert \mathbf{X}\mathbf{w} - \mathbf{y} \rVert^2 \end{equation*} $$

A simple analogy is to think of mean squared error as a set of springs, one per training example. The objective of the learning algorithm is to balance the learned hyperplane by attempting to push it as close as we can to each of the training samples. Thus, the futher the training sample is to our hyperplane, the stronger the force is on a particular spring.

Minimising the Error Function in Matrix Form

Now, to get a good approximation, we need to select weights $\mathbf{w}$ so that the error $J(\mathbf{w})$ is minimised. This is commonly called ordinary least squares or OLS. There are several ways to do this, for example, gradient descent, however, for now we will simply take the derivative of $J(\mathbf{w})$ with respect to $\mathbf{w}$ and then equate it to zero to get the closed-form solution.

First though, we need to expand the mean squared error representation so that we can differentiate it. The constant $\frac{1}{N}$ has been removed as it will not impact the selected weights.


$$ \begin{equation*} \begin{aligned} J(\mathbf{w}) &= \lVert \mathbf{X}\mathbf{w} - \mathbf{y}\rVert^2 \\ & = (\mathbf{X}\mathbf{w} - \mathbf{y})^\intercal(\mathbf{X}\mathbf{w} - \mathbf{y}) \\ & = ((\mathbf{X}\mathbf{w})^\intercal - \mathbf{y}^\intercal)(\mathbf{X} \mathbf{w} - \mathbf{y}) \\ & = (\mathbf{X}\mathbf{w})^\intercal \mathbf{X}\mathbf{w} - (\mathbf{X}\mathbf{w})^\intercal \mathbf{y} - \mathbf{y}^\intercal(\mathbf{X} \mathbf{w}) + \mathbf{y}^\intercal\mathbf{y} \\ & = \mathbf{w}^\intercal\mathbf{X}^\intercal\mathbf{X}\mathbf{w} - 2(\mathbf{X}\mathbf{w})^\intercal \mathbf{y} + \mathbf{y}^\intercal\mathbf{y} \\ & = \mathbf{w}^\intercal\mathbf{X}^\intercal\mathbf{X}\mathbf{w} - 2\mathbf{y}^\intercal\mathbf{X}\mathbf{w} + \mathbf{y}^\intercal\mathbf{y} \end{aligned} \end{equation*} $$

Before we move on, here are some useful properties for matrix differentiation:
  • $\frac{\partial \mathbf{w}^\intercal\mathbf{A}\mathbf{w}}{\partial \mathbf{w}} = 2\mathbf{A}^\intercal\mathbf{w}$
  • $\frac{\partial \mathbf{B}\mathbf{w}}{\partial \mathbf{w}} = \mathbf{B}^\intercal$

Let $A = \mathbf{X}^\intercal\mathbf{X}$ and $B = 2\mathbf{y}^\intercal\mathbf{X}$. Substitute and differentiate:


$$ \begin{equation*} \begin{aligned} \frac{\partial J(\mathbf{w})}{\partial \mathbf{w}} &= \frac{\partial}{\partial \mathbf{w}} (\mathbf{w}^\intercal\mathbf{X}^\intercal\mathbf{X}\mathbf{w} - 2\mathbf{y}^\intercal\mathbf{X}\mathbf{w} + \mathbf{y}^\intercal\mathbf{y}) \\ &= \frac{\partial}{\partial \mathbf{w}} (\mathbf{w}^\intercal A \mathbf{w} - B\mathbf{w} + \mathbf{y}^\intercal\mathbf{y}) \\ &= 2\mathbf{A}^\intercal\mathbf{w} - \mathbf{B}^\intercal + 0 \end{aligned} \end{equation*} $$

Now, let's replace $\mathbf{A}$ and $\mathbf{B}$:


$$ \begin{equation*} \frac{\partial J(\mathbf{w})}{\partial \mathbf{w}} = 2\mathbf{X}^\intercal\mathbf{X}\mathbf{w} - 2\mathbf{X}^\intercal\mathbf{y} \end{equation*} $$

Finally, let's throw away constant terms, equate to zero, and solve for $\mathbf{w}$:


$$ \begin{equation*} \begin{aligned} \frac{\partial J(\mathbf{w})}{\partial \mathbf{w}} &= 0 \\ \mathbf{X}^\intercal\mathbf{X}\mathbf{w} - \mathbf{X}^\intercal\mathbf{y} &= 0 \\ \mathbf{X}^\intercal\mathbf{X}\mathbf{w} &= \mathbf{X}^\intercal\mathbf{y} \\ \mathbf{w} &= (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y} \end{aligned} \end{equation*} $$

And there we have it, the closed-form solution for ordinary least squares.

Notice how we have to compute the inverse of a matrix. This means that $\mathbf{X}^\intercal\mathbf{X}$ must be non-singular, however, there are ways to circumvent this issue, for example, by using the Moore-Penrose pseudoinverse instead: numpy.linalg.pinv().

Using the Closed-Form Solution

To use the closed-form solution we derived above to solve the income vs. education problem we require a few things, namely:

  • The design matrix $\mathbf{X}$
  • A column vector of ground truths $\mathbf{y}$
  • A function that takes the two aforementioned matrices and evaluates the closed-form solution to get a set of weights $\mathbf{w}$

The last two requirements will have to be implemented by you.

In the follow code snippet we will:
  • Create the design matrix $\mathbf{X}$

In [ ]:
def build_X(x_data):
    """Return design matrix given an array of N samples with d dimensions.
    """
    # Create matrix Ax1 if d = 1
    if x_data.ndim == 1:
        x_data = np.expand_dims(x_data, axis=1)

    # Find the number of samples and dimensions
    nb_samples = x_data.shape[0]
    nb_dimensions = x_data.shape[1]

    # Create Nxd+1 matrix filled with ones
    _X = np.ones((nb_samples, nb_dimensions + 1))

    # Paste in the data we have in the new matrix
    _X[:nb_samples, 1:nb_dimensions + 1] = x_data

    return _X

# Test and see that the design matrix was built correctly
tests.test_build_x(build_X)

Task I: Build y

The second component we require is the vector $\mathbf{y}$. This is a column vector over all the ground truths or target values in our training dataset. For completeness, it has the following form:


$$ \begin{equation*} \mathbf{y} = \left[ \begin{array}{c} y_{1} \\ y_{2} \\ \vdots\\ y_{N}\end{array} \right] \end{equation*} $$

**Task**: Build the $\mathbf{y}$ vector shown above. Use the previous code snippet as a reference for your implementation.

In [ ]:
def build_y(y_data):
    """Return a column vector containing the target values y.
    """
    # Make a copy of the argument that we can work on
    _y = y_data.copy()

    # Create y matrix Nx1


    # Return result
    return _y

### Do *not* modify the following line ###
# Test and see that the y vector was built correctly
tests.test_build_y(build_y)

Task II: Implement Closed-Form Solution

Now that we have both the design matrix $\mathbf{X}$ and the vector of target values $\mathbf{y}$ we can fit a linear model using the closed-form solution we derived before. Remember all of we have to do is implement the following expression:

$$ \begin{equation*} \mathbf{w} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y} \end{equation*} $$

Please refer to the following sources for how to utilise the various functions in NumPy when implementing your solution:

**Task**: Implement a function that evaluates the closed-form solution given a design matrix $\mathbf{X}$ and target vector $\mathbf{y}$.

In [ ]:
def compute_weights(X, y):
    """Return a vector of weights found by the derived closed-form solution.
    """
    weights = None

    # Implement closed-form solution here

    
    return weights

### Do *not* modify the following line ###
# Test and see that the weights are calculated correctly
tests.test_compute_theta(compute_weights)

Task III: Learn a Linear Regression Model

We have now implemeted all of the necessary building blocks:

  • build_X() : Used to build the design matrix $\mathbf{X}$
  • build_y() : Used to build the vector of target values $\mathbf{y}$
  • compute_weights : Used to fit a linear model to the data using the solution we derived above

After we have estimated $\mathbf{w}$ we can perform predictions on unseen data by computing: $\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}$.

**Task**: Learn the weights $\mathbf{w}$ given the building blocks we have implemented.

In [ ]:
# Build design matrix (TASK)
X = None

# Build y vector (TASK)
y = None

# Learn linear model (TASK)
W = None
In the follow code snippet we will:
  • Print the weights we learned
  • Plot the hyperplane (line in our case because $d=1$) that $\mathbf{w}$ represents

In [ ]:
# Print weights
print('The learned linear model looks like this:')
print('Y = {:.3f} x + {:.3f}'.format(W[1, 0], W[0, 0]))

# Plot hyperplane and training data
xs = np.linspace(X_data.min(), X_data.max(), num=50)
ys = np.dot(build_X(xs), W)

plt.figure()

plt.scatter(X_data, y_data, label='Training data')
plt.plot(xs, ys, color='Red', linewidth=1, label='Fit')

plt.title('Education vs. Income')
plt.xlabel('Education index')
plt.ylabel('Income index')
plt.grid(linestyle='dotted')
plt.legend()

plt.show()

Critical Analysis

Albeit easy to derive and easy to use, our closed-form solution has a few shortcomings:

  • Requires matrix inversion
    • Very computationally expensive
    • Not ideal for distributed computing
  • Issues become apparant when the number of features $d$ and number of samples $N$ begin to grow
    • Depending on the size of the dataset it might be difficult / infeasible to fit all of it in memory

To tackle these issues we will attempt to solve the linear regression problem using an iterative optimisation method called gradient descent.

Gradient Descent

In artificial neural network literature one can see several different symbols in use to signify the error function. For example, in addition to $J$ there is also $E$ (error), $L$ (loss), $C$ (cost), and even $err$. The rest of this notebook will use the symbol $E$ instead of $J$.

Gradient descent is an iterative optimisation algorithm. In general, it works by taking the derivative of an error function $E(\mathbf{w})$ with respect to the parameters $\mathbf{w}$ of the model, and then alter the parameters in the direction of the negative gradient.

This can be summarised as: $\mathbf{w}(k+1)\leftarrow\mathbf{w}(k) - \eta\frac{\partial E(\mathbf{w})}{\partial\mathbf{w}}$, where $\mathbf{w}(k)$ signifies the state of the model parameters at iteration $k$, and $\eta$ is known as the learning rate and decides how much the parameters should change with each application of the rule.

This update rule is repeated until convergence or until the maximum number of iterations has been reached.

With gradient descent we can:

  • Reduce memory issues by only working on parts of the data at a time
  • Distribute the computation among several computational nodes. This enables distributed computing and parallelisation which allows us to exploit new architectures such as GPUs, FPGAs, and ASICs
  • Gradient descent is a heavily use type of algorithm that opens the door for models such as artificial neural networks

Digression: A Different Perspective

Linear models, such as linear regression, can be represented as artifical neural networks. An illustration of this can be seen below:

As before, the input $\mathbf{x} \in \mathbb{R}^d$ and the input is integrated via a linear combination plus a bias. The integrated value is activated by an activation function $\sigma$, which for our linear regression model is defined as $\sigma(x) = x$.

In other words, $\hat{y}$ is defined as $\sigma(\mathbf{X}\mathbf{w})$, which simplifies to $\mathbf{X}\mathbf{w}$ because the activation function used for linear regression is the identity function. In artificial neural network terminology we would typically say that the activation function is linear.

Learning with Gradient Descent

As we saw above, learning with gradient descent is easy. All we have to do is apply an update rule a set number of iterations until we are satisfied with the resulting weights. The update rule can be be seen below:

$$ \begin{equation*} \mathbf{w}(k+1)\leftarrow\mathbf{w}(k) - \eta\frac{\partial E(\mathbf{w})}{\partial\mathbf{w}} \end{equation*} $$

In words, the weights for the next iteration $k+1$ is the weights of the current iteration $k$ plus the negative gradient $\frac{\partial E(\mathbf{w})}{\partial\mathbf{w}}$ scaled by the learning rate $\eta$. In other words, for each iteration in gradient descent we adjust the weights we have with respect to the gradient of the error function $E(\mathbf{w})$.

An illustration of how this could look like with the mean squared error function can be seen below:

The current state of several different weight states are signified by red dots, while the arrow points in the negative gradient direction. The optimal weight state is found at the minima, which yields the lowest amount of error.

Finding the Gradient

To finalise the update rule we need to find: $\frac{\partial E(\mathbf{w})}{\partial\mathbf{w}}$. This, of course, depends on the form of $E(\mathbf{w})$.

The squared error for a single sample $\mathbf{x}_i$ in the training dataset is defined as:

$E(\mathbf{w}) = (\hat{y}_i - y_i)^2$

where $\hat{y}_i=\sigma(g)$ and $g(\mathbf{x})=\mathbf{w}^\intercal \mathbf{x}_i$.

To simplify the derivation we will scale the squared error by halving it; this will not change the optimal solution:

$E(\mathbf{w}) = \frac{1}{2}(\hat{y}_i - y_i)^2$

Let's now attempt to find the derivative we need:


$$ \begin{equation*} \frac{\partial E(\mathbf{w})}{\partial\mathbf{w}} = \frac{\partial}{\partial\mathbf{w}}( \frac{1}{2}(\hat{y}_i - y_i)^2) \end{equation*} $$

Seeing as $\hat{y}$ is dependent on $\mathbf{w}$ we will need to use the chain rule of calculus.

Let $a(b) = \frac{1}{2}(b)^2$ and $b(\mathbf{w}) = (\hat{y}_i - y_i)$, then $\frac{\partial a}{\partial\mathbf{w}}=\frac{\partial a}{\partial b}\frac{\partial b}{\partial\mathbf{w}}$.

Therefore:


$$ \begin{equation*} \begin{aligned} \frac{\partial E(\mathbf{w})}{\partial\mathbf{w}} &= \frac{\partial a}{\partial b}\frac{\partial b}{\partial\mathbf{w}} \\ &= (\hat{y}_i - y_i)\frac{\partial}{\partial\mathbf{w}}(\hat{y}_i - y_i) \\ &= (\hat{y}_i - y_i)((\frac{\partial}{\partial\mathbf{w}}\hat{y}_i) - (\frac{\partial}{\partial\mathbf{w}}y_i)) \\ &= (\hat{y}_i - y_i)((\frac{\partial}{\partial\mathbf{w}}\hat{y}_i) - 0) \\ &= (\hat{y}_i - y_i)\frac{\partial}{\partial\mathbf{w}}\hat{y}_i \\ \end{aligned} \end{equation*} $$

Keep in mind that:

  • $\hat{y}_i=\sigma(g)$
  • $g(\mathbf{x})=\mathbf{w}^\intercal \mathbf{x}_i$.

For now, let's replace $\hat{y}$ with $\sigma(g)$:


$$ \begin{equation*} \begin{aligned} \frac{\partial E(\mathbf{w})}{\partial\mathbf{w}} &= (\hat{y}_i - y_i)\frac{\partial}{\partial\mathbf{w}}\sigma(g) \end{aligned} \end{equation*} $$

Again we have to use the chain rule.

Let $a(b) = \sigma(b)$ and $b(\mathbf{w}) = (\mathbf{w}^\intercal \mathbf{x}_i)$, then $\frac{\partial a}{\partial\mathbf{w}}=\frac{\partial a}{\partial b}\frac{\partial b}{\partial\mathbf{w}}$.


$$ \begin{equation*} \begin{aligned} \frac{\partial E(\mathbf{w})}{\partial\mathbf{w}} &= (\hat{y}_i - y_i)\frac{\partial a}{\partial b}\frac{\partial b}{\partial\mathbf{w}} \\ &= (\hat{y}_i - y_i)\sigma '(g)\mathbf{x}_i \end{aligned} \end{equation*} $$

Thus, the update rule for gradient descent, regardless of activation function, is defined as:


$$ \begin{equation*} \mathbf{w}(k+1) \leftarrow \mathbf{w}(k) - \eta((\hat{y}_i - y_i)\sigma '(g)\mathbf{x}_i) \end{equation*} $$

Seeing as we're doing linear regression, we know that activation function is linear, i.e. $\sigma(x)=x$, where $\sigma'(x)=1$. So the final update rule will look like this:

$$ \begin{equation*} \begin{aligned} \mathbf{w}(k+1) &\leftarrow \mathbf{w}(k) - \eta((\hat{y}_i - y_i)\mathbf{x}_i) \\ &\leftarrow \mathbf{w}(k) - \eta((\mathbf{w}^\intercal \mathbf{x}_i - y_i)\mathbf{x}_i) \end{aligned} \end{equation*} $$

Note that this updates the weights using only a single input example. This is generally called stochastic gradient descent. Typically the amount we adjust by is taken over a batch, i.e. subset, of examples.

For completeness, the update rule above can be defined over a set of $m$ samples like so:

$$ \begin{equation*} \mathbf{w}(k+1) \leftarrow \mathbf{w}(k) - \eta\frac{1}{m}\sum_{i=i}^{m}(\mathbf{w}^\intercal \mathbf{x}_i - y_i)\mathbf{x}_i \end{equation*} $$

Gradient Descent with Keras

Thankfully, when using gradient descent we do not need to derive and implement it ourselves as there are many programming libraries out there that can do automatic differentiation for us.

In this and future notebooks we will be using the Python library Keras. This is a high-level library for building and training artificial neural networks running on either TensorFlow or Theano. We will be able to leverage Keras when creating our linear regression model with gradient descent because linear models can be interpreted as artificial neural networks.

In the following code snippets we will:
  • Create a linear regression model for the `Income` vs. `Education` problem in Keras
  • Train the model using (stochastic) gradient descent

Let's start by importing the modules we need from Keras as well as some additional ones we will use during training.


In [ ]:
import time

# A library for easily displaying progress meters
import tqdm

# Contains all built-in optimisation tools in Keras, such as stochastic gradient descent
from keras import optimizers

# An input "layer" and a densely-connected neural network layer
from keras.layers import Input, Dense

# Model is an API that wraps our linear regression model
from keras.models import Model

The input to our model is a single scalar value (Education). The output is also a single scalar value (Income).


In [ ]:
# There is only a *single* feature
input_X = Input(shape=(1,))

# The output of the model is a single value
output_y  = Dense(units=1, use_bias=True)(input_X)

# We give the input and output to our Model API
model = Model(inputs=input_X, outputs=output_y)

# Print a summary of the model
model.summary()

Notice in the print above how the fully-connected layer Dense() has two trainable parameters. One is the weight (slope), while the second is the bias (intercept). Keras adds bias units by default, but it can be turned off by setting use_bias=False.

The next thing we have to do in Keras is to set up an optimiser (sometimes called solver). There are many alternatives to select from, however, we will settle for the stochastic gradient descent algorithm we discussed earlier.


In [ ]:
#
# Start by setting some user options
#

# Learning rate (set very small so we can clearly see the training progress)
lr = 0.0001

# Number of times to apply the update rule
nb_iterations = 100

# Number of samples to include each iteration (used to compute gradients)
nb_samples = 30

# Create optimiser using Keras
sgd = optimizers.SGD(lr=lr)

# Add the optimiser to our model, make it optimise mean squared error
model.compile(optimizer=sgd, loss='mean_squared_error')

Now that both the model definition and the optimiser is set up we can start training. Training using the Keras model application programming interface (API) is done by calling the fit() method.

Don't worry too much if this code is a bit too much right now. We will get much more experience with using Keras throughout the upcoming notebooks.

While training the model, a plot is continuously updated to display the fitted line.


In [ ]:
fig, ax = plt.subplots(1,1)

# Perform `nb_iterations` update rule applications
for i in tqdm.tqdm(np.arange(nb_iterations)):
    # Learn by calling the `fit` method
    model.fit(X_data, y_data, 
              batch_size=nb_samples, 
              epochs=1, 
              verbose=0)

    # Make a plot of the data and the current fit
    xs = np.linspace(X_data.min(), X_data.max(), num=50)
    ys = model.predict(xs)

    ax.clear() 

    ax.scatter(X_data, y_data, label='Training data')
    ax.plot(xs, ys, color='Red', linewidth=1, label='Fit')

    ax.set_xlabel('Education index')
    ax.set_ylabel('Income index')
    ax.grid(linestyle='dotted')
    ax.legend()

    fig.canvas.draw()

    time.sleep(0.05)

In [ ]: