In [ ]:
"""This area sets up the Jupyter environment.
Please do not modify anything in this cell.
"""
import os
import sys
# Add project to PYTHONPATH for future use
sys.path.insert(1, os.path.join(sys.path[0], '..'))
# Import miscellaneous modules
from IPython.core.display import display, HTML
# Set CSS styling
with open('../admin/custom.css', 'r') as f:
style = """<style>\n{}\n</style>""".format(f.read())
display(HTML(style))
In supervised learning we assume that our data consist of input - output pairs. A learning algorithm analyses the data and produces a function, or model, we can use to infer outputs given unseen future inputs.
Below we can see a simplified illustration of the supervised learning problem.
Pairs of inputs $\mathbf{x}$ and outputs $y$ constitutes our training examples, where the inputs are sampled from a probability distribution. A pair $(\mathbf{x}, y)$ is related by an unknown target function $f$ governed by a conditional probability distribution. The ultimate goal of supervised learning is to learn a function $g$ which approximates $f$ well.
The particular approximation $g$ we pick is called a hypothesis. A learning algorithm is responsible for picking the most appropriate hypothesis from a hypothesis set. The decision between which hypothesis to pick is done by looking at the data and typically involves an error function which measures how good a hypothesis may be.
When our learning algorithm has picked a good hypothesis, we can feed it new and unseen samples to produce output estimates.
The name of the data typically differ depending on which area you are from.
The input variables are commonly known as:
- covariates
- predictors
- features
The output variables are commonly known as:
- variates
- targets
- labels
For now we will focus on one of the simplest supervised learning problems: linear regression.
A linear regression model learns a real-valued function where one or more dependent output variable(s) depend linearly on one or more independent input variable(s). Geometrically, this real-valued function can be interpreted as a hyperplane which we attempt to fit to our data.
This notebook will use the following notation:
Keep in mind that additional notation will be introduced as we continue through the notebooks.
In the following example we will load data from a CSV file and use to estimate a linear model between an Education index
and a Income index
.
First, let's begin by importing a selection of Python package that will prove useful for the rest of this Jupyter notebook.
In [ ]:
# Plots will be show inside the notebook
%matplotlib notebook
import matplotlib.pyplot as plt
# NumPy is a package for manipulating N-dimensional array objects
import numpy as np
# Pandas is a data analysis package
import pandas as pd
import problem_unittests as tests
With Pandas we can load the aforementioned CSV data.
In [ ]:
# Load data and print the first n = 5 rows
# URL: http://www-bcf.usc.edu/~gareth/ISL/Income1.csv
DATA_URL = './resources/Income1.csv'
data = pd.read_csv(DATA_URL, index_col=0)
print(data.head(n=5))
# Put the second (education index) and third (income index) row in a NumPy array
X_data = data['Education'].values
y_data = data['Income'].values
With the data loaded we can plot it as a scatter plot using matplotlib.
In [ ]:
plt.figure()
plt.scatter(X_data, y_data, label='Training data')
plt.title('Education vs. Income')
plt.xlabel('Education index')
plt.ylabel('Income index')
plt.grid(linestyle='dotted')
plt.legend()
plt.show()
As previously mentioned, we will be using a linear model. That is, the output will be a linear combination of the input plus a bias or intercept:
$$ \begin{equation*} g(\mathbf{x}) = b + \sum_{j=1}^{d}w_jx_j \end{equation*} $$Keep in mind that in this problem there is only a single independent variable $\mathbf{x}$, which means the above can be simplified to: $g(x) = b + wx$, where $b$ is the intercept and $w$ is the slope.
To simplify notation, it is quite common to merge the bias $b$ with the weights $w_i$ to get a single weight vector $\mathbf{w} = (w_0, w_1, \ldots, w_d)^\intercal$, where $w_0 = b$. Consequently, an extra dimension must be prepended to the input vector, i.e. $\mathbf{x} = (1, x_1, \ldots, x_d)^\intercal$.
With this simplification the linear model can be written as:
$$ \begin{equation*} g(\mathbf{x}) = \sum_{j=1}^{d}w_jx_j \end{equation*} $$The above model takes a single input $\mathbf{x}$ and produces a single output prediction. We can take this one step further by putting all of the input examples in a single matrix called the design matrix $\mathbf{X}$. This matrix consists of one (training) example per row.
$$
\begin{equation*}
\mathbf{X} =
\begin{bmatrix}
1 & \mathbf{x}_{11} & \cdots & \mathbf{x}_{1d} \\
\vdots & \vdots & \ddots & \vdots \\
1 & \mathbf{x}_{N1} & \cdots & \mathbf{x}_{Nd}
\end{bmatrix} =
\left[ \begin{array}{c} \mathbf{x}_{1}^\intercal \\ \vdots\\ \mathbf{x}_{N}^\intercal\end{array} \right]
\end{equation*}
$$
With the design matrix, predictions can be done by matrix multiplication:
$$
\begin{equation*}
\hat{\mathbf{y}} = \mathbf{X}\mathbf{w} =
\begin{bmatrix}
1 & \mathbf{x}_{11} & \cdots & \mathbf{x}_{1d} \\
\vdots & \vdots & \ddots & \vdots \\
1 & \mathbf{x}_{N1} & \cdots & \mathbf{x}_{Nd}
\end{bmatrix}
\left[ \begin{array}{c} \mathbf{w}_{0} \\ \mathbf{w}_{1} \\ \vdots\\ \mathbf{w}_{d}\end{array} \right] =
\left[ \begin{array}{c} y_{1} \\ y_{2} \\ \vdots\\ y_{N}\end{array} \right]
\end{equation*}
$$
To measure how well our hypothesis, i.e. a particular set of weights, approximates the unknown target function $f$ we will have to come up with an error function. This quantification, which we will call $J$, goes by several different names:
We will be using squared error: $(g(\mathbf{x}) - f(\mathbf{x}))^2$ to measure how well our hypothesis approximates $f$. Seeing as we do not have access to $f$ we will instead compute an in-sample squared error over all our training data. This measure is commonly known as mean squared error (MSE):
$$ \begin{equation*} J(\mathbf{w}) = \frac{1}{N}\sum_{i=1}^{N}(g(\mathbf{x}_i) - y_i)^2 = \frac{1}{N}\sum_{i=1}^{N}(\mathbf{w}^\intercal \mathbf{x}_i - y_i)^2 = \frac{1}{N}\lVert \mathbf{X}\mathbf{w} - \mathbf{y} \rVert^2 \end{equation*} $$A simple analogy is to think of mean squared error as a set of springs, one per training example. The objective of the learning algorithm is to balance the learned hyperplane by attempting to push it as close as we can to each of the training samples. Thus, the futher the training sample is to our hyperplane, the stronger the force is on a particular spring.
Now, to get a good approximation, we need to select weights $\mathbf{w}$ so that the error $J(\mathbf{w})$ is minimised. This is commonly called ordinary least squares or OLS. There are several ways to do this, for example, gradient descent, however, for now we will simply take the derivative of $J(\mathbf{w})$ with respect to $\mathbf{w}$ and then equate it to zero to get the closed-form solution.
First though, we need to expand the mean squared error representation so that we can differentiate it. The constant $\frac{1}{N}$ has been removed as it will not impact the selected weights.
$$
\begin{equation*}
\begin{aligned}
J(\mathbf{w}) &= \lVert \mathbf{X}\mathbf{w} -
\mathbf{y}\rVert^2 \\
& = (\mathbf{X}\mathbf{w} - \mathbf{y})^\intercal(\mathbf{X}\mathbf{w} -
\mathbf{y}) \\
& = ((\mathbf{X}\mathbf{w})^\intercal - \mathbf{y}^\intercal)(\mathbf{X}
\mathbf{w} - \mathbf{y}) \\
& = (\mathbf{X}\mathbf{w})^\intercal \mathbf{X}\mathbf{w} -
(\mathbf{X}\mathbf{w})^\intercal \mathbf{y} - \mathbf{y}^\intercal(\mathbf{X}
\mathbf{w}) + \mathbf{y}^\intercal\mathbf{y} \\
& = \mathbf{w}^\intercal\mathbf{X}^\intercal\mathbf{X}\mathbf{w} -
2(\mathbf{X}\mathbf{w})^\intercal \mathbf{y} + \mathbf{y}^\intercal\mathbf{y} \\
& = \mathbf{w}^\intercal\mathbf{X}^\intercal\mathbf{X}\mathbf{w} -
2\mathbf{y}^\intercal\mathbf{X}\mathbf{w} + \mathbf{y}^\intercal\mathbf{y}
\end{aligned}
\end{equation*}
$$
Let $A = \mathbf{X}^\intercal\mathbf{X}$ and $B = 2\mathbf{y}^\intercal\mathbf{X}$. Substitute and differentiate:
$$
\begin{equation*}
\begin{aligned}
\frac{\partial J(\mathbf{w})}{\partial \mathbf{w}}
&= \frac{\partial}{\partial \mathbf{w}}
(\mathbf{w}^\intercal\mathbf{X}^\intercal\mathbf{X}\mathbf{w} -
2\mathbf{y}^\intercal\mathbf{X}\mathbf{w} +
\mathbf{y}^\intercal\mathbf{y}) \\
&= \frac{\partial}{\partial \mathbf{w}}
(\mathbf{w}^\intercal A \mathbf{w} -
B\mathbf{w} +
\mathbf{y}^\intercal\mathbf{y}) \\
&= 2\mathbf{A}^\intercal\mathbf{w} - \mathbf{B}^\intercal + 0
\end{aligned}
\end{equation*}
$$
Now, let's replace $\mathbf{A}$ and $\mathbf{B}$:
$$
\begin{equation*}
\frac{\partial J(\mathbf{w})}{\partial \mathbf{w}}
= 2\mathbf{X}^\intercal\mathbf{X}\mathbf{w} - 2\mathbf{X}^\intercal\mathbf{y}
\end{equation*}
$$
Finally, let's throw away constant terms, equate to zero, and solve for $\mathbf{w}$:
$$
\begin{equation*}
\begin{aligned}
\frac{\partial J(\mathbf{w})}{\partial \mathbf{w}}
&= 0 \\
\mathbf{X}^\intercal\mathbf{X}\mathbf{w} - \mathbf{X}^\intercal\mathbf{y} &= 0 \\
\mathbf{X}^\intercal\mathbf{X}\mathbf{w} &= \mathbf{X}^\intercal\mathbf{y} \\
\mathbf{w} &= (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y}
\end{aligned}
\end{equation*}
$$
And there we have it, the closed-form solution for ordinary least squares.
Notice how we have to compute the inverse of a matrix. This means that $\mathbf{X}^\intercal\mathbf{X}$ must be non-singular, however, there are ways to circumvent this issue, for example, by using the Moore-Penrose pseudoinverse instead: numpy.linalg.pinv()
.
To use the closed-form solution we derived above to solve the income
vs. education
problem we require a few things, namely:
The last two requirements will have to be implemented by you.
In [ ]:
def build_X(x_data):
"""Return design matrix given an array of N samples with d dimensions.
"""
# Create matrix Ax1 if d = 1
if x_data.ndim == 1:
x_data = np.expand_dims(x_data, axis=1)
# Find the number of samples and dimensions
nb_samples = x_data.shape[0]
nb_dimensions = x_data.shape[1]
# Create Nxd+1 matrix filled with ones
_X = np.ones((nb_samples, nb_dimensions + 1))
# Paste in the data we have in the new matrix
_X[:nb_samples, 1:nb_dimensions + 1] = x_data
return _X
# Test and see that the design matrix was built correctly
tests.test_build_x(build_X)
The second component we require is the vector $\mathbf{y}$. This is a column vector over all the ground truths or target values in our training dataset. For completeness, it has the following form:
$$
\begin{equation*}
\mathbf{y} = \left[ \begin{array}{c} y_{1} \\ y_{2} \\ \vdots\\ y_{N}\end{array} \right]
\end{equation*}
$$
In [ ]:
def build_y(y_data):
"""Return a column vector containing the target values y.
"""
# Make a copy of the argument that we can work on
_y = y_data.copy()
# Create y matrix Nx1
# Return result
return _y
### Do *not* modify the following line ###
# Test and see that the y vector was built correctly
tests.test_build_y(build_y)
Now that we have both the design matrix $\mathbf{X}$ and the vector of target values $\mathbf{y}$ we can fit a linear model using the closed-form solution we derived before. Remember all of we have to do is implement the following expression:
$$ \begin{equation*} \mathbf{w} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y} \end{equation*} $$Please refer to the following sources for how to utilise the various functions in NumPy when implementing your solution:
In [ ]:
def compute_weights(X, y):
"""Return a vector of weights found by the derived closed-form solution.
"""
weights = None
# Implement closed-form solution here
return weights
### Do *not* modify the following line ###
# Test and see that the weights are calculated correctly
tests.test_compute_theta(compute_weights)
We have now implemeted all of the necessary building blocks:
build_X()
: Used to build the design matrix $\mathbf{X}$build_y()
: Used to build the vector of target values $\mathbf{y}$compute_weights
: Used to fit a linear model to the data using the solution we derived aboveAfter we have estimated $\mathbf{w}$ we can perform predictions on unseen data by computing: $\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}$.
In [ ]:
# Build design matrix (TASK)
X = None
# Build y vector (TASK)
y = None
# Learn linear model (TASK)
W = None
In [ ]:
# Print weights
print('The learned linear model looks like this:')
print('Y = {:.3f} x + {:.3f}'.format(W[1, 0], W[0, 0]))
# Plot hyperplane and training data
xs = np.linspace(X_data.min(), X_data.max(), num=50)
ys = np.dot(build_X(xs), W)
plt.figure()
plt.scatter(X_data, y_data, label='Training data')
plt.plot(xs, ys, color='Red', linewidth=1, label='Fit')
plt.title('Education vs. Income')
plt.xlabel('Education index')
plt.ylabel('Income index')
plt.grid(linestyle='dotted')
plt.legend()
plt.show()
Albeit easy to derive and easy to use, our closed-form solution has a few shortcomings:
To tackle these issues we will attempt to solve the linear regression problem using an iterative optimisation method called gradient descent.
Gradient descent is an iterative optimisation algorithm. In general, it works by taking the derivative of an error function $E(\mathbf{w})$ with respect to the parameters $\mathbf{w}$ of the model, and then alter the parameters in the direction of the negative gradient.
This can be summarised as: $\mathbf{w}(k+1)\leftarrow\mathbf{w}(k) - \eta\frac{\partial E(\mathbf{w})}{\partial\mathbf{w}}$, where $\mathbf{w}(k)$ signifies the state of the model parameters at iteration $k$, and $\eta$ is known as the learning rate and decides how much the parameters should change with each application of the rule.
This update rule is repeated until convergence or until the maximum number of iterations has been reached.
With gradient descent we can:
Linear models, such as linear regression, can be represented as artifical neural networks. An illustration of this can be seen below:
As before, the input $\mathbf{x} \in \mathbb{R}^d$ and the input is integrated via a linear combination plus a bias. The integrated value is activated by an activation function $\sigma$, which for our linear regression model is defined as $\sigma(x) = x$.
In other words, $\hat{y}$ is defined as $\sigma(\mathbf{X}\mathbf{w})$, which simplifies to $\mathbf{X}\mathbf{w}$ because the activation function used for linear regression is the identity function. In artificial neural network terminology we would typically say that the activation function is linear.
As we saw above, learning with gradient descent is easy. All we have to do is apply an update rule a set number of iterations until we are satisfied with the resulting weights. The update rule can be be seen below:
$$ \begin{equation*} \mathbf{w}(k+1)\leftarrow\mathbf{w}(k) - \eta\frac{\partial E(\mathbf{w})}{\partial\mathbf{w}} \end{equation*} $$In words, the weights for the next iteration $k+1$ is the weights of the current iteration $k$ plus the negative gradient $\frac{\partial E(\mathbf{w})}{\partial\mathbf{w}}$ scaled by the learning rate $\eta$. In other words, for each iteration in gradient descent we adjust the weights we have with respect to the gradient of the error function $E(\mathbf{w})$.
An illustration of how this could look like with the mean squared error function can be seen below:
The current state of several different weight states are signified by red dots, while the arrow points in the negative gradient direction. The optimal weight state is found at the minima, which yields the lowest amount of error.
To finalise the update rule we need to find: $\frac{\partial E(\mathbf{w})}{\partial\mathbf{w}}$. This, of course, depends on the form of $E(\mathbf{w})$.
The squared error for a single sample $\mathbf{x}_i$ in the training dataset is defined as:
$E(\mathbf{w}) = (\hat{y}_i - y_i)^2$
where $\hat{y}_i=\sigma(g)$ and $g(\mathbf{x})=\mathbf{w}^\intercal \mathbf{x}_i$.
To simplify the derivation we will scale the squared error by halving it; this will not change the optimal solution:
$E(\mathbf{w}) = \frac{1}{2}(\hat{y}_i - y_i)^2$
Let's now attempt to find the derivative we need:
$$
\begin{equation*}
\frac{\partial E(\mathbf{w})}{\partial\mathbf{w}}
= \frac{\partial}{\partial\mathbf{w}}( \frac{1}{2}(\hat{y}_i - y_i)^2)
\end{equation*}
$$
Seeing as $\hat{y}$ is dependent on $\mathbf{w}$ we will need to use the chain rule of calculus.
Therefore:
$$
\begin{equation*}
\begin{aligned}
\frac{\partial E(\mathbf{w})}{\partial\mathbf{w}}
&= \frac{\partial a}{\partial b}\frac{\partial b}{\partial\mathbf{w}} \\
&= (\hat{y}_i - y_i)\frac{\partial}{\partial\mathbf{w}}(\hat{y}_i - y_i) \\
&= (\hat{y}_i - y_i)((\frac{\partial}{\partial\mathbf{w}}\hat{y}_i) - (\frac{\partial}{\partial\mathbf{w}}y_i)) \\
&= (\hat{y}_i - y_i)((\frac{\partial}{\partial\mathbf{w}}\hat{y}_i) - 0) \\
&= (\hat{y}_i - y_i)\frac{\partial}{\partial\mathbf{w}}\hat{y}_i \\
\end{aligned}
\end{equation*}
$$
Keep in mind that:
For now, let's replace $\hat{y}$ with $\sigma(g)$:
$$
\begin{equation*}
\begin{aligned}
\frac{\partial E(\mathbf{w})}{\partial\mathbf{w}}
&= (\hat{y}_i - y_i)\frac{\partial}{\partial\mathbf{w}}\sigma(g)
\end{aligned}
\end{equation*}
$$
Again we have to use the chain rule.
$$
\begin{equation*}
\begin{aligned}
\frac{\partial E(\mathbf{w})}{\partial\mathbf{w}}
&= (\hat{y}_i - y_i)\frac{\partial a}{\partial b}\frac{\partial b}{\partial\mathbf{w}} \\
&= (\hat{y}_i - y_i)\sigma '(g)\mathbf{x}_i
\end{aligned}
\end{equation*}
$$
Thus, the update rule for gradient descent, regardless of activation function, is defined as:
$$
\begin{equation*}
\mathbf{w}(k+1) \leftarrow \mathbf{w}(k) - \eta((\hat{y}_i - y_i)\sigma '(g)\mathbf{x}_i)
\end{equation*}
$$
Seeing as we're doing linear regression, we know that activation function is linear, i.e. $\sigma(x)=x$, where $\sigma'(x)=1$. So the final update rule will look like this:
Note that this updates the weights using only a single input example. This is generally called stochastic gradient descent. Typically the amount we adjust by is taken over a batch, i.e. subset, of examples.
For completeness, the update rule above can be defined over a set of $m$ samples like so:
$$ \begin{equation*} \mathbf{w}(k+1) \leftarrow \mathbf{w}(k) - \eta\frac{1}{m}\sum_{i=i}^{m}(\mathbf{w}^\intercal \mathbf{x}_i - y_i)\mathbf{x}_i \end{equation*} $$Thankfully, when using gradient descent we do not need to derive and implement it ourselves as there are many programming libraries out there that can do automatic differentiation for us.
In this and future notebooks we will be using the Python library Keras. This is a high-level library for building and training artificial neural networks running on either TensorFlow or Theano. We will be able to leverage Keras when creating our linear regression model with gradient descent because linear models can be interpreted as artificial neural networks.
Let's start by importing the modules we need from Keras as well as some additional ones we will use during training.
In [ ]:
import time
# A library for easily displaying progress meters
import tqdm
# Contains all built-in optimisation tools in Keras, such as stochastic gradient descent
from keras import optimizers
# An input "layer" and a densely-connected neural network layer
from keras.layers import Input, Dense
# Model is an API that wraps our linear regression model
from keras.models import Model
The input to our model is a single scalar value (Education
). The output is also a single scalar value (Income
).
In [ ]:
# There is only a *single* feature
input_X = Input(shape=(1,))
# The output of the model is a single value
output_y = Dense(units=1, use_bias=True)(input_X)
# We give the input and output to our Model API
model = Model(inputs=input_X, outputs=output_y)
# Print a summary of the model
model.summary()
Notice in the print above how the fully-connected layer Dense()
has two trainable parameters. One is the weight (slope), while the second is the bias (intercept). Keras adds bias units by default, but it can be turned off by setting use_bias=False
.
The next thing we have to do in Keras is to set up an optimiser (sometimes called solver). There are many alternatives to select from, however, we will settle for the stochastic gradient descent algorithm we discussed earlier.
In [ ]:
#
# Start by setting some user options
#
# Learning rate (set very small so we can clearly see the training progress)
lr = 0.0001
# Number of times to apply the update rule
nb_iterations = 100
# Number of samples to include each iteration (used to compute gradients)
nb_samples = 30
# Create optimiser using Keras
sgd = optimizers.SGD(lr=lr)
# Add the optimiser to our model, make it optimise mean squared error
model.compile(optimizer=sgd, loss='mean_squared_error')
Now that both the model definition and the optimiser is set up we can start training. Training using the Keras model application programming interface (API) is done by calling the fit()
method.
Don't worry too much if this code is a bit too much right now. We will get much more experience with using Keras throughout the upcoming notebooks.
While training the model, a plot is continuously updated to display the fitted line.
In [ ]:
fig, ax = plt.subplots(1,1)
# Perform `nb_iterations` update rule applications
for i in tqdm.tqdm(np.arange(nb_iterations)):
# Learn by calling the `fit` method
model.fit(X_data, y_data,
batch_size=nb_samples,
epochs=1,
verbose=0)
# Make a plot of the data and the current fit
xs = np.linspace(X_data.min(), X_data.max(), num=50)
ys = model.predict(xs)
ax.clear()
ax.scatter(X_data, y_data, label='Training data')
ax.plot(xs, ys, color='Red', linewidth=1, label='Fit')
ax.set_xlabel('Education index')
ax.set_ylabel('Income index')
ax.grid(linestyle='dotted')
ax.legend()
fig.canvas.draw()
time.sleep(0.05)
In [ ]: