Linear Regression Lab
February 17th 2015

Regression refers to the technique employed when learning a mapping from vector of input data to a quantitative output. For example, coordinates in a room to temperature, someone's Body Mass Index to their life expectancy or a stock's performance over the last 5 days to it's value tomorrow.

There are numerous methods for addressing this problem, each with their own set of assumptions and behaviour. Some are ideal for tackling large volumes of data while others provide more informative probabilistic outputs.

In this lab, we'll focus on Linear Regression; a good point of reference for many other regression techniques and still used widely around the world today due to their simplicity and favourable scaling characteristics.

Part I: Maximum Likelihood Linear Regression

In this section, we're going to explore what linear regression models are and investigate how they can be trained using a "Maximum Likelihood" approach.


In [1]:
# Import related packages
import numpy as np
import scipy.linalg as linalg
import scipy.stats as stats
import matplotlib.pyplot as pl

#Set generated images to appear underneath the related cell
%matplotlib inline

First, lets generate some synthetic data. Here's we create a random 1-D vector $\mathbf{x}$ of observation locations and evaluate an "unknown" ground truth function at these locations before adding some noise to create our observations, $\mathbf{y}$.


In [2]:
nObs = 20          # number of observations
betaInv = 0.2      # variance of Gaussian noise applied to samples
x = np.sort(np.random.random(nObs)*10)[:,np.newaxis]  # observation locations
y = 0.5*x + 1 + 1*np.sin(x)+ (np.random.randn(nObs)*betaInv)[:,np.newaxis] # ground truth (we don't know this!)

Lets plot the data first to see what it looks like:


In [3]:
pl.figure(figsize=(15,5))
pl.plot(x,y,'b.')
pl.axis([0,10,0,7])
pl.xlabel('x')
pl.ylabel('y')
pl.show()


Linear Model with Polynomial Features

Okay, now lets try to fit a model to it.

One of the most straightforward types of regression is maximum likelihood regression using a linear model, (the output is a weighted sum of the features). For now, lets use a very simple model with just two parameters (or weights), $\mathbf{w}$. \begin{equation} f(x) = w_0 + w_1 x \end{equation} This is just a line with $w_0$ and $w_1$ representing the bias and slope, respectively.

In matrix form this is $\mathbf{f}= \mathbf{\theta w}$ where $\mathbf{\theta}$ is $[x^0,x^1]$ (the feature matrix) and $\mathbf{w}$ is $[w_0, w_1]^T$.

The feature matrix has a dimensionality of $N\times p$ where $N$ is the number of inputs and $p$ is the number of features. The complexity of the model can be increased by increasing $p$. Polynomial features (increasing powers of $x$) are commonly used in practice. Evaluating a polynomial model at locations $[x_1,...,x_N]$ and weights $[w_0,...,w_p]$ is given as:

\begin{equation} \begin{bmatrix} f_1 \\ f_2 \\ \vdots \\ f_N \end{bmatrix} = \begin{bmatrix} x_1^0 & x_1^1 & \cdots & x_1^{p-1} \\ x_2^0 & x_2^1 & \cdots & x_2^{p-1} \\ \vdots & \vdots & \ddots & \vdots \\ x_N^0 & x_N^1 & \cdots & x_N^{p-1} \end{bmatrix} \begin{bmatrix} w_0 \\ w_1 \\ \vdots \\ w_{p-1} \end{bmatrix} \end{equation}

Lets query our model and compare it against the data. To generate a polynomial feature matrix $\theta$, we'll use custom function called polyFeatureGen() that takes the observation or query locations ($x$ or $x_{\text{query}}$) and the number of features ($p$) as inputs. It returns an $n \times p$ matrix such that: \begin{equation} \theta_{i,j} = x_i^{j-1} \end{equation}


In [4]:
# Define Polynomial basis functions -------------------
# x: input vector (n x d) where n is the number of inputs and d is their dimensionality (you can assume d=1 for this lab)
# p: number of features (scalar)
# theta: the feature matrix (n x p)

def polyFeatureGen(x,p):
    #Solution:
    theta = x**np.arange(p)
    return theta

In [5]:
# Test polyFeatureGen
# Run this cell. The output should be: 
# [[1  1  1]
#  [1  4 16]
#  [1  3  9]
#  [1  6 36]]

xTest = np.array([1,4,3,6])[:,np.newaxis] # 4 1-D inputs
pTest = 3 #This test model will have 3 features
thetaTest = polyFeatureGen(xTest,pTest) #the test feature matrix (should be 4 x 3)
print(thetaTest)


[[ 1  1  1]
 [ 1  4 16]
 [ 1  3  9]
 [ 1  6 36]]

For now, lets stick with two features. Lets pick two arbitriary values for the weights. Say,


In [6]:
p = 2 #number of features
w_guess = np.array([-0.2, 5])[:,np.newaxis]

Generate the query points and the associated feature matrix


In [7]:
nQuery = 1000 # number of query inputs
xQuery = np.linspace(0,10,nQuery)[:,np.newaxis] #linearly spaced from 0 to 10

Write a line of code to generate the associated feature matrix for the query points:


In [8]:
#Uncomment and complete the line of code below to generate thetaQuery, (the feature matrix for the query points):
#
#Solution:
thetaQuery = polyFeatureGen(xQuery,p)

print(thetaQuery) #Should be a nQuery x p matrix


[[  1.           0.        ]
 [  1.           0.01001001]
 [  1.           0.02002002]
 ..., 
 [  1.           9.97997998]
 [  1.           9.98998999]
 [  1.          10.        ]]

Now evaluate the model at the query locations


In [9]:
fQuery = thetaQuery.dot(w_guess)
print(fQuery.shape)


(1000, 1)

And plot the results:


In [10]:
fig = pl.figure(figsize=(15,5))
pl.plot(x,y,'b.')
pl.plot(xQuery,fQuery,'r')
pl.xlabel('x')
pl.ylabel('f(x)')
pl.show()


It's not a great fit! The Mean Squared Error is:


In [11]:
fModel = polyFeatureGen(x,p).dot(w_guess)
MSE = np.mean((fModel - y)**2)
print('Mean Squared Error: ' + str(MSE))


Mean Squared Error: 746.150265796

Maximum Likeliood Estimation

Now lets learn the parameters of our model to see if we can improve the fit. The goal here is to estimate the parameter values (i.e. $\mathbf{w}$) from the data. A popular way of doing this is by optimising the conditional likelihood of the data given a model, $p(y|x,w,\beta^{-1})$, with respect to the parameters.

A common assumption to make is that the measurements we gathered are subject to Gaussian noise i.e. we assume that our observations, y are drawn from a Gaussian distribution with mean $\mathbf{w\theta^T}$ and variance $\beta^{-1}$ (indicating the level of noise).

The probability of acquiring a particular set of observation, given a specific model is \begin{equation} p(y|x,w,\beta^{-1}) = \prod_{n=1}^{N} \mathcal{N}(\mathbf{w\theta^T}, \beta^{-1}) \end{equation}

It is often referred to as the likelihood of the data. In log form: \begin{equation} \text{ln}p(\mathbf{y}|\mathbf{x},\mathbf{w},\beta^{-1}) = \sum_{n=1}^{N} \text{ln}\big( \mathcal{N}(\mathbf{w\theta^T}, \beta^{-1})\big) \\ = \frac{N}{2} \text{ln}(\beta) - \frac{N}{2}\text{ln}(2\pi) - \frac{\beta}{2} \sum_{n=1}^{N} (\mathbf{y}-\mathbf{w\theta^T})^2 \end{equation}

The final term is the sum of squares error between the observations and the model. If we vary the parameters of the model to maximise the likelihood of the data, we train a model that best fits the data The gradient of the log likelihood with respect to the weights is: \begin{equation} \triangledown \text{ln}p(\mathbf{y}|\mathbf{x},\mathbf{w},\beta^{-1})= \sum_{n=1}^{N} (\mathbf{y}_n - \mathbf{w\theta^T})\theta \end{equation} Find the max by setting the gradient to zero and then solve for $\mathbf{w}$ to find the "maximum-likelihood parameters" (wml): \begin{equation} \mathbf{w}{ml}^T = (\mathbf{\theta}^T\mathbf{\theta})^{-1}\mathbf{\theta^T y} \end{equation}

Generate feature matrix using the observations locations.


In [12]:
#Solution:
theta = polyFeatureGen(x,p)

Then find the parameters that maximise the likelihood as described above.


In [13]:
#Implement a line of code to determine w_ml. Hint: use np.linalg.solve(A,B) 
#rather than np.linalg.inv(A)B for better numerical stability

#Solution:
w_ml = np.linalg.solve((np.transpose(theta).dot(theta)),(np.transpose(theta))).dot(y)

print(w_ml)


[[ 0.41310195]
 [ 0.58601903]]

Now evaluate the model at the query locations using the new parameters and plot the results as before:


In [14]:
# Solution:
fQuery = thetaQuery.dot(w_ml)       # the model's predictions at the query locations. This will be an nQuery x 1 matrix

fig = pl.figure(figsize=(15,5))
pl.plot(x,y,'b.')
pl.plot(xQuery,fQuery,'r')
pl.axis([0, 10, 0, 8])
pl.xlabel('x')
pl.ylabel('f(x)')
pl.title('Max Likelihood weights with ' + str(p) + ' features')
pl.show()


Much better. Calculate and print the mean squared error between the model prediction and the observations using these new parameters to quantively compare with our initial set of parameters


In [15]:
#Solution
fModel = theta.dot(w_ml)                 # the model's predictions at the observation locations. This will be an N x 1 matrix
MSE = np.mean((fModel - y)**2)           # scalar
print('Mean Squared Error: ' + str(MSE))


Mean Squared Error: 0.392305925411

This MSE is the best we can do with this model.

Increasing the Model Complexity

To reduce the MSE further, we will need to increase the model's flexibility by adding higher order polynomial features. In the cell below, increase the number of feature in the model, retrain it, plot the resulting query outputs as before and evaluate the MSE. What is the number of features that gives the lowest MSE?


In [16]:
# Set the number of features
p = 8 #number of features

# Build the design feature matrix
theta = polyFeatureGen(x,p)

# Determine the maximum likelihood weights
w_ml = np.linalg.solve((np.transpose(theta).dot(theta)),(np.transpose(theta))).dot(y)

# Build the feature matrix for the query points
thetaQuery = polyFeatureGen(xQuery,p)

# Evaluate the model at the query points
fQuery = thetaQuery.dot(w_ml)

fig = pl.figure(figsize=(15,5))
pl.plot(x,y,'b.')
pl.plot(xQuery,fQuery,'r')
pl.axis([0, 10, 0, 8])
pl.xlabel('x')
pl.ylabel('f(x)')
pl.title('Number of Features: '+str(p))
pl.show()

fModel = theta.dot(w_ml)
MSE = np.mean((fModel - y)**2)
print('Mean Squared Error: ' + str(MSE))


Mean Squared Error: 0.019448867593

As you may have noticed, increasing the flexibility and model complexity decreases the mean squared error because the model's ability to fit all of the observation exactly increases as well. However, it also results in "overfitting". This is when the model fits the observations very well but generalises poorly to out locations. If a number of observations are withheld from training, the model would not do well at predicting their value as it is more concerned with overfitting the observations it can see.

Other Types of Features (also known as Basis Functions)

Up until now, the feature matrix has been generated using polynomial basis functions:

\begin{equation} \theta_{i,j} = x_i^{j-1} \end{equation}

Run the cell below to visualise the basis functions that were used in the last model you trained:


In [17]:
pl.figure(figsize=(15,5))
for i in range(p):
    pl.plot(xQuery,thetaQuery[:,(i-1)])
    pl.axis([0, 10, 0, 80])
    pl.title('Individual basis functions')
pl.show()


The above graph is simply plotting each column or feature of the feature matrix.

Multiply each basis functions by the learnt it's associated weight, w_ml[i], and plot the results.


In [18]:
pl.figure(figsize=(15,5))
for i in range(p):
    pl.plot(xQuery,thetaQuery[:,(i-1)]*w_ml[i])
    pl.axis([0, 10, -80, 80])
    pl.title('Individual weighted basis functions')
pl.show()
print('Learnt weights: ' + str(w_ml.transpose()))


Learnt weights: [[  6.94876283e+00  -2.00026847e+01   2.04606702e+01  -9.13155568e+00
    2.10026572e+00  -2.59423463e-01   1.63943566e-02  -4.17155772e-04]]

Interestingly, the model's complicated looking prediction is just a linear combination of these basis functions. Hence the name "linear" regression.

However, there are many possible variants of basis functions each giving the resulting model different characteristics. Radial basis functions are a commonly used alternative to the polynomial function: \begin{equation} \theta_{i,j} = \text{exp}\Bigg(-\frac{(x_i-\mu_j)^2}{2s^2}\Bigg) \end{equation}

Here, $\mu_j$, is the mean of the $j^{\text{th}}$ radial basis function and $s$ corresponds to the basis functions width or its region of influence.

Implement the basis function radialFeatureGen() that takes a vector of inputs ($\textbf{x}$), a vector of $p$ mean locations ($\mathbf{\mu}$) with shape $p \times 1$ and a basis function spatial width ($s$). It then calculates the feature matrix, $\theta$, using the radial basis function given above.


In [19]:
# Define Radial basis functions -------------------
# Dimensions:
# x: n x 1
# mu: p x 1
# s: 1 x 1
# theta: n x p

def radialFeatureGen(x,mu, s):
    # Code goes here
    #Solution:
    theta = np.exp(-(((x-mu.transpose())**2)/(2*s**2)))   
    return theta

Using the values for $\mu$ and $s$ provided below, write some code that finds the weights with the maximum likelihood ($w_{ml}$) for a design matrix generated using the radial basis function. Then evaluate the model at the query locations and plot the results as before.


In [20]:
mu = np.array([1,2,3,4,5])[:,np.newaxis] #The mean of the radial basis functions used in this model
s = 1 # the spatial width of the basis function

#Solution:
theta = radialFeatureGen(x,mu,s)
w_ml = np.linalg.solve((np.transpose(theta).dot(theta)),(np.transpose(theta))).dot(y)
thetaQuery = radialFeatureGen(xQuery,mu,s)

# Evaluate the model at the query points
fQuery = thetaQuery.dot(w_ml)

# Plot the results:
fig = pl.figure(figsize=(15,5))
pl.plot(x,y,'b.')
pl.plot(xQuery,fQuery,'r')
pl.axis([0, 10, 0, 8])
pl.xlabel('x')
pl.ylabel('f(x)')
pl.title('Radial Basis Function')

pl.figure(figsize=(15,5))
for i in range(mu.shape[0]):
    pl.plot(xQuery,thetaQuery[:,(i-1)][:,np.newaxis]*w_ml[i])
    pl.title('Individual weighted basis functions')
pl.show()
print('Learnt weights: ' + str(w_ml.transpose()))


Learnt weights: [[  5.20689104 -10.95883545  14.04892152 -10.13378618   7.5608242 ]]

Rerun the above cell but change the basis functions by altering $\mu$. e.g. mu = np.array([1,2,3,4,5,6,7,8,9,10])[:,np.newaxis]

  • What effect does changing $s$ have on the model?
  • How might we find the optimal values for $\mu$ and $s$?
  • How does the behaviour of the radial basis function model differ from the poynomial case in regions where data is sparse?

Downsides to this Approach

While this approach to regressing over data is appealling, it does have a few shortcomings. For example, we have no estimate of how confident the model is in its predictions (which is crucial in decision making applications). Also the model learns a single hypothesis given the data. In reality there may be numerous explanations for the same set of observations. In Part II of today's lab, we will investigate a probabilitic approach to regression.

$$ % Bayesian Linear Regression Tutorial, Alistair Reid, NICTA 2015 \newcommand{\W} [0] {\mathbf{w}} \newcommand{\X} [0] {\mathbf{x}} \newcommand{\y} [0] {\mathbf{y}} \newcommand{\Xf} [0] {\mathbf{\theta}} \newcommand{\norm} [0] {\mathcal{N}} \newcommand{\weightprec} [0] {Q} \newcommand{\weightprecsimp} [0] {\alpha I} \newcommand{\alphs} [0] {\mathbf{\alpha}} \newcommand{\invnoise} [0] {\beta} \newcommand{\noise} [0] {\beta^{-1}} \newcommand{\Ainv} [0] {A^{-1}} \newcommand{\fq} [0] {\mathbf{f_*}} \newcommand{\xq} [0] {\mathbf{x_*}} $$

Part 2: Bayesian Linear Regression

In the previous sections, we have selected the linear regression weights by maximising the likelihood of the data. The learnt model certainly makes useful predictions, but it provides no notion of uncertainty: we cannot distinguish between a prediction that is tightly constrained by the data, and a prediction that is essentially an unconstrained `guess'. Also, if there are multiple likely explanations of the data, we have neglected all but one. Recall that for a given feature mapping $\X \to \Xf$, the predictive behaviour of a linear regression is encoded in the weights: \begin{equation} f(x) = \Xf \W \end{equation}

Where $\W$ is $p \times 1$ and $\Xf$ is $n \times p$. Instead of using a single maximum likelihood weight vector, we can take a Bayesian approach and consider a probability distribution over all possible weight vectors. The distribution can be inferred from the observations using Bayes rule:

\begin{equation} p(\W|\y) \propto p(\y|\W) p(\W) \end{equation}

The likelihood $p(\y|\W)$ is exactly the criterion we were using to select the maximum likelihood weights:

\begin{equation} p(\y | \W) = \norm (y | \Xf \W, \noise I ) \end{equation}

In addition, we must now specify a prior distribution over $\W$ encoding our belief about reasonable weight values. Any probability distribution we consider to be sensible for the problem could be used here, but some are easier to work with than others. In this tutorial, we will use a simple Gaussian prior with precision (inverse covariance) $\weightprecsimp$ that is $p \times p$ with a single free parameter:

\begin{equation} \W \sim \norm( 0, \alphs^{-1}I ) \end{equation}

Hence we are assuming the weights are independently distributed & penalised for large magnitudes. While the $\W$'s are model parameters that directly affect the output, we consider the $\alphs$'s to be hyperparameters because they tune the prior distribution over the parameters $\W$.

Because we have assumed a Gaussian form for both the likelihood and the prior, Bayes rule can be computed analytically in this case to obtain a posterior distribution over the weights. This involves expanding the exponents of the two Gaussians and completing the square (an exercise left to the reader).

The posterior over the weights is given by: \begin{equation} p(\W|\Xf,y) \sim \norm \left( \mu_w, \Ainv \right) \end{equation} Where $A$ is the $p \times p$ posterior precision given by: \begin{equation} A = \invnoise \Xf^\top \Xf + \weightprecsimp \end{equation} And the weight means are given by: \begin{equation} \mu_w = \invnoise \Ainv \Xf^\top \y \end{equation}

Exercise: Compute Posterior Weight Distribution


In [21]:
import numpy as np
import scipy.linalg as linalg
import matplotlib.pyplot as pl
import scipy.optimize # In addition, nlopt would be a great optimisation library to consider
from math import log, pi
%matplotlib inline

In [22]:
# Generate Noisy Data ---------------------------------
nSamp = 5
noiseSTD = 0.5
xdomain = 10
x = np.sort(np.random.random(nSamp)*xdomain)[:,np.newaxis] # x is n by 1
y = x + np.random.randn(nSamp)[:,np.newaxis]*noiseSTD - 5  # y is also n by 1

pl.figure(figsize=(5,5) )
pl.plot(x,y,'k.')
pl.grid()
pl.title('Noisy observations of the linear target function')
pl.xlabel('x')
pl.ylabel('y')
pl.show()



In [23]:
# Use 2 features - (1, x) to learn intercept and slope -------------------
p = 2    # n features...
n = len(x)

# build a feature vector n*[1, x]
theta = polyFeatureGen(x,p)   

# Define a prior
alpha = 2**-2           # Weight precision -> Larger alpha 'relaxes' the complexity penalty - try it
beta  = noiseSTD**-2     # noise inverse standard deviation - cheat and use the real value

# Compute the posterior
A = beta * theta.T.dot(theta) + alpha * np.eye(p)
mu_w = beta * linalg.solve(A, theta.T.dot(y))

# Plot a countour ellipse of the weight distribution
th = np.linspace(0,2*pi,100)
xy = mu_w - np.log(0.1/2)*linalg.cholesky(linalg.inv(A),lower=False).dot(np.vstack((np.sin(th), np.cos(th))))

postpl = pl.plot(xy[0], xy[1],'b')
xy = np.zeros(mu_w.shape) - np.log(0.1/2)*np.vstack((np.sin(th), np.cos(th)))/np.sqrt(alpha)
pripl = pl.plot(xy[0], xy[1],'g')


pl.xlabel('Intercept')
pl.ylabel('Gradient')
pl.title('Weights (prior=green, posterior=blue) - contour at 0.1 max probability')
pl.axis('equal')
pl.show()


Marginalising over weights

While it is interesting to see the distribution of weights used by the Bayesian linear regression, the ultimate goal is to predict the outputs $\fq$ corresponding to inputs $\xq$. To do this, we marginalise out the weights $\W$:

\begin{equation} p(\fq | \xq, \Xf, \y) = \int p(\fq|\xq,\W) p(\W|\Xf,\y) d\W \end{equation}

We know from above that: \begin{equation} p(\W|\Xf,y) \sim \norm ( \mu_w, \Ainv ) \end{equation}

We have also defined a deterministic mapping to the predictive function: \begin{equation} \fq = \xq \W \end{equation}

This marginalisation has a closed form solution - we can produce predictive distributions, considering all possible weights, without explicity computing them: \begin{equation} \fq \sim \norm \left( \invnoise \xq \Ainv \Xf^\top \y, \hspace{0.5cm} \xq \Ainv \xq^\top \right) \end{equation}

Note that this equation gives a multivariate Gaussian prediction: for $m$ queries, the covariance will be $m \times m$. This is useful for joint draws. However, if you are only interested in the envelope of uncertainty, then the diagonal can be extracted to give the marginal variance of each prediction. Don't forget to add the noise variance back on if you want to predict the range of possible observations rather than latent functions!

Exercise:


In [24]:
# Make predictions directly from the data-------------------
nQuery = 1000
xQuery = np.linspace(-1, 11, nQuery)[:,np.newaxis]
thetaQuery = polyFeatureGen(xQuery,p)

# lets make this a reusable function
def linreg(theta, y, thetaQuery, alpha, beta):
    I = np.eye(theta.shape[1])
    A = beta * theta.T.dot(theta) + alpha*I
    mu_w = beta * linalg.solve(A, theta.T.dot(y))
    f_mean = thetaQuery.dot(mu_w)

    
    #f_covar = thetaQuery.dot(linalg.solve(A, thetaQuery.T))+1/beta
    #f_std = np.sqrt(np.diag(f_covar))[:,np.newaxis]
    
    # Advanced challenge: use cholesky factorisation to compute the marginal variance directly
    L = linalg.cholesky(A, lower=True)  # Factorise A for faster computation
    f_std = np.sqrt( np.sum( linalg.solve(L, thetaQuery.T )**2, axis=0)  + 1./beta)[:,np.newaxis] # try with and without noise variance
    
    #f_std = np.sqrt(np.diag(f_covar))
    return f_mean, f_std

f_mean, f_std = linreg(theta, y, thetaQuery, alpha, beta)

# plot the mean, and +- 2 standard deviations
pl.figure(figsize=(5,5) )
pl.plot(x,y,'k.')
pl.plot(xQuery, f_mean, 'b')
pl.plot(xQuery, f_mean+2.5*f_std, 'r')
pl.plot(xQuery, f_mean-2.5*f_std, 'r')
pl.xlabel('Input x')
pl.ylabel('Output y')
pl.title('Predictive Envelope with Data and Mean')
pl.show()


Exercise: Non-linear Basis Functions

We have already been using simple linear basis functions, but lets try a demo with a non-linear problem. Don't worry about visualising the distribution over weights - with many polynomial features they will be too high dimensional to show, but we will get very interesting predictive marginals:


In [25]:
# Generate Noisy Sinusoid Data ---------------------------------
nSamp = 30
noiseSTD = 0.1
xdomain = 10
x = np.sort(np.random.random(nSamp)) 
x = x-np.min(x)
x = x/np.max(x) * xdomain
y = np.sin(x) + np.random.randn(nSamp)*noiseSTD
x = x[:,np.newaxis]
y = y[:,np.newaxis]


pl.figure(figsize=(15,5) )
pl.plot(x,y,'k.')
pl.xlabel('x')
pl.ylabel('y')
pl.title('Noisy observations of the target function')
pl.grid()
pl.show()


This data clearly has a non-linear pattern. As we saw in part 1 of this tutorial, linear regression techniques can perform well on non-linear data by introducing basis functions. Lets apply the polynomial basis functions you implemented in polyFeatureGen to increase the dimensionality of the inputs to $p=7$ and run the regression again.


In [26]:
# Bayesian linear regression - prediction with polynomial basis functions
# Define prediction locations -------------------
nQuery = 1000
xQuery = np.linspace(0, 10, nQuery)[:,np.newaxis]

# transform into feature space (now n by p)
p = 7   # n features...
theta = polyFeatureGen(x,p)
thetaQuery =  polyFeatureGen(xQuery,p)

# Define a prior
alpha = 0.7**-2 # per-weight inverse standard deviation - chosen automatically using strategy in final section
beta  = 0.1**-2 # noise inverse standard deviation - using the real value for now

f_mean, f_std = linreg(theta, y, thetaQuery, alpha, beta)

# plot the mean, and +- 2 standard deviations
pl.figure(figsize=(15,5) )
pl.plot(x,y,'k.')
pl.plot(xQuery, f_mean, 'b')
pl.plot(xQuery, f_mean+2.5*f_std, 'r')
pl.plot(xQuery, f_mean-2.5*f_std, 'r')
pl.xlabel('Input x')
pl.ylabel('Output y')
pl.title('Predictive Envelope with Data and Mean')
pl.show()


Now the predictions are able to capture the non-linear structure of this problem! There are two ways of interpreting how this is working.

The first interpretation is that we are inferring a probability distribution over the weights, and each weight corresponds to a basis function. With the polynomial basis functions, we are therefore considering distributions over the polynomial coefficients in a polynomial equation. You can see in the predictive distribution: \begin{equation} \fq \sim \norm \left( \invnoise \xq \Ainv \Xf^\top \y, \hspace{0.5cm} \xq \Ainv \xq^\top \right) \end{equation} that our mean function is in fact a linear combination of the basis functions composing $\Xf$, and the space of models is restricted to the span of the basis functions.

A second interpretation is that we have used polyFeatureGen to project the inputs and the queries into 7 dimensions, effectively warping the space on the x-axis. The line in the high dimensional space can then project to a nonlinear function in the original space. To understand this principle, think about how a direct path an airline takes around a 3D sphere can appear as a curved arc when projected onto a 2d map ). You will see this principle applied in the support vector machine tutorial.

You can investigate the choice of basis functions by modifying the above code:

  • What happens to the prediction as you move away from the input data? (try changing the domain above)
  • We have been using polynomial basis functions. Can you think of a different class of basis functions that might work for periodic data?

Plotting draws from the posterior

We have been looking at the mean and envelope of the predictive distribution, but it is also useful to look at draws from the posterior distribution. To do this we must look at the joint multivariate distribution of the posterior. Note that given our particular choice of Gaussian prior and likelihood, the posterior predictions are also joint-Gaussian. For example, we can compute a posterior multivariate Gaussian over the weights and draw random weights out of this distribution. This will sample a set of deterministic models each drawn with probability consistent with our Bayesian analysis.

How do we draw values out of a multivariate Gaussian distribution? We can do this by linear operations on a vector of independent univariate Gaussian draws. This can be achieved using the cholesky factorisation: if $L$ is the upper Cholesky factor of the covariance matrix $A^{-1}$, and $\mathbf{\Sigma}$ is a vector of iid Gaussians, we may draw from the distribution according to: \begin{equation} \hat{\W} = \mu_w + L \Sigma \end{equation}


In [27]:
# Plotting draws from the function
I = np.eye(theta.shape[1])
A = beta * theta.T.dot(theta) + alpha*I
mu_w = beta * linalg.solve(A, theta.T.dot(y))
L = linalg.cholesky(linalg.inv(A), lower=True)

nPlots = 10
fig = pl.figure(figsize=(15,5))
pl.plot(x,y,'k.')
for i in range(nPlots):
    weights = mu_w + L.dot(np.random.randn(p)[:,np.newaxis])
    f_star = thetaQuery.dot(weights)
    pl.plot(xQuery, f_star)
pl.xlabel('x')
pl.ylabel('y')
pl.title('Draws from the posterior')
pl.show()


  • If we were to average over a very large number of these draws, the density at each input value would be Gaussian distributed with the marginals from the previous section.

Advanced: Hyperparameter selection

Until now we have been manually specifying values for $\alphs^2$ and $\invnoise$. Such manual selection is not ideal, as it requires a deep understanding of every dataset and large amounts of manual tuning to select appropriate noise and prior hyperparameters. Instead, there are two main approaches for hyperparameter selection based on the data. Both rely on the concept of marginal likelihood - the probability of drawing the observations out of the prior. We can obtain this criterion by marginalising out the actual weight vectors and noise draws. The log marginal likelihood given all our Gausianity assumptions so far is given by:

\begin{equation} \log p(y | \Xf) = \frac{1}{2} \left( p \log \alphs^2 + n \log \invnoise - \log |A| - n \log 2\pi - \invnoise (y-y^*)^\top (y-y^*) - \alphs^2 \mu^{*\top} \mu^* \right) \end{equation}\begin{equation} \mu^* = \invnoise \Ainv \Xf^T y \end{equation}\begin{equation} \y^* = \Xf \mu^* \end{equation}

See "Evaluation of the evidence function" in Bishop - Pattern recognition and machine learning for how this is derived. Here $|A|$ is the determinant of A, and $\mu^*$ and $\y^*$ are the mean weights and mean latent function predictions at $\X$ respectively using the current hyperparameters. This criterion naturally weights model fit against overfitting. If the prior is too tight to fit the data, we cannot draw the observations, but if it is too general then the probability of drawing any particular dataset is also low. It can become very small or large and is best computed in log form.

The first hyperparameter strategy is to define a hyper-prior over $\alphs$ and $\mathbf{\beta}$. If we carefully choose a conjugate prior (for our scenario this will be an inverse gamma distribution) over these parameters, then it is possible to marginalise over the hyperparameters (in the same way we marginalised the weights out of the predictive distribution) and obtain a closed form solution for the weight distribution. An interested student is directed to Bishop, Pattern Recognition and Machine Learning. Unfortunately, this distribution will be of Student-T form rather than Gaussian, so marginalising out the weights to obtain predictions will then be a (straight forward) exercise in numerical integration.

A simpler approach that we will take here is to search for the maximum likelihood hyperparameters - this is also called the evidence approximation or type 2 maximum likelihood. . The main drawback of this approach is that we loose a level of Bayesian inference compared to the above alternative. Also we must select these parameters via optimisation strategies and the problem is non-convex, meaning we may end up in a local optimum that is better than its neighbours but not the best set of parameters. However, it is a more principled method of model selection than manually tweaking the parameters.

Below we will use a simple gradient-free numerical optimisation procedure to find some hyperparameters:


In [28]:
# Compute the log marginal likelihood
def nLML(params, theta, y, verbose=True):

    alpha_ = params[0]**2
    beta_  = params[1]**2
    p_ = theta.shape[1]
    I = np.eye(p_)
    
    if verbose:
        print('Querying:', alpha_, beta_)
    
    A = alpha_*I + beta_*np.dot(theta.T, theta) 
    
    # The log determinant of a positive semi-definite matrix is given by twice the log-trace of its cholesky factorisation.
    L = linalg.cholesky(A, lower=True)
    mn = beta_*linalg.solve(A, np.dot(theta.T,y))
    predY = np.dot(theta, mn)
    LML = 0.5* (p_*log(alpha_) + nSamp*log(beta_) - 2*np.sum(np.log(np.diag(L))) - nSamp*log(2*pi)
                - beta_*np.sum((y - predY)**2) - alpha_*np.sum(mn**2))
    return -LML
  
params0 = [1., 1.] # # signal, noise inverse standard deviations
# Note we havent defined numerical gradients (which might be a good idea later), so we will use
# the COBYLA algorithm that can estimate them automatically
nLMLcriterion = lambda params: nLML(params, theta, y)
newparams = scipy.optimize.minimize(nLMLcriterion, params0, method='COBYLA') # gradient free
alpha = newparams.x[0]**2
beta  = newparams.x[1]**2
print('Local Optimum:\n alpha = ',alpha, ', beta=', beta)


Querying: 1.0 1.0
Querying: 4.0 1.0
Querying: 4.0 4.0
Querying: 4.76828283359 8.89825082184
Querying: 9.3754577175 11.9792744431
Querying: 8.17739691358 19.7173997388
Querying: 8.76943353114 29.5418573547
Querying: 4.51310030042 35.7911663426
Querying: 4.34387958267 48.7450174316
Querying: 2.33679570135 61.0469284401
Querying: 3.75265562622 65.6350340256
Querying: 5.46823905781 70.5578503581
Querying: 5.23109951122 79.1608227278
Querying: 7.54717235986 82.6839907651
Querying: 4.15093387108 79.3493677075
Querying: 5.66617807396 80.649749649
Querying: 5.34915052014 76.9988629341
Querying: 4.95181236795 79.006260824
Querying: 4.67758241391 78.9920400065
Querying: 4.93249324811 79.5573563944
Querying: 4.72357053903 78.3881352714
Querying: 4.83701833308 78.6968946056
Querying: 4.85639456883 78.148931342
Querying: 4.83958883157 77.6015171158
Querying: 4.97443783932 77.4768912017
Querying: 4.76869842271 77.1311127118
Querying: 4.80407824714 77.3661361532
Querying: 4.87288733784 77.566486446
Querying: 4.90586755063 77.6074989974
Querying: 4.91916930582 77.4804608515
Querying: 4.93559167424 77.5025583587
Querying: 4.90329080482 77.4530510314
Querying: 4.88758922683 77.4243006346
Querying: 4.90673937187 77.4215257063
Querying: 4.88638391764 77.4387567465
Querying: 4.89483371143 77.4459037241
Querying: 4.89371708667 77.4625089145
Querying: 4.89547424909 77.4289063766
Querying: 4.89526940764 77.4117403783
Querying: 4.89333763665 77.4276326351
Querying: 4.89979206585 77.4296960681
Querying: 4.90404866102 77.4266673638
Querying: 4.90192013242 77.4281817086
Querying: 4.89960159349 77.4254667818
Querying: 4.89784728787 77.4334461373
Querying: 4.8988196286 77.4315710913
Querying: 4.89850635761 77.4356833082
Querying: 4.89743108325 77.4361106622
Querying: 4.89891014643 77.4364044954
Querying: 4.89925228731 77.4375212641
Querying: 4.89954892942 77.4362148735
Querying: 4.89904431399 77.4390749229
Local Optimum:
 alpha =  4.89904431399 , beta= 77.4390749229

Now try using these values in the previous sections!

Part III: Stopping Distance Assignment

Now that you know the basics of linear regression, lets apply it to a real data problem. The cell below loads a dataset of stopping distances (metres) of a car travelling at different speeds (km/h). If you were in the same car travelling at 21 km/h and you hit the brakes at a distance of 135 metres from a wall, what is the probability of you coming to a stop before hitting the wall?


In [29]:
#Load data
data = np.load('cars_stopping_dist.npz')
speed= data['speed']
stoppingDist= data['dist']

In [30]:
#Solution

# plot raw data
pl.figure(figsize=(15,5) )
pl.plot(speed,stoppingDist,'k.')
pl.xlabel('Input x')
pl.ylabel('Output y')
pl.title('Raw Data')
pl.show()

#whiten data 
speed_std = np.var(speed)**0.5
dist_std = np.var(stoppingDist)**0.5
speed_whitened = speed/speed_std
dist_whitened = stoppingDist/dist_std



In [31]:
# Train the model

nQuery = 1000
speedQuery = np.linspace(0, np.max(speed_whitened)*1.2, nQuery)[:,np.newaxis]  # Generates query locations

# transform into feature space (now n by p)
p = 8  # n features...

mu = np.linspace(0,np.max(speedQuery),10)[:,np.newaxis] #The mean of the radial basis functions used in this model
s = 5.5 # the spatial width of the basis function

#Solution:
theta = radialFeatureGen(speed_whitened,mu,s)
thetaQuery = radialFeatureGen(speedQuery,mu,s)

params0 = np.array([0.7**-2, 0.3**-2 ]) # signal, noise inverse standard deviations
# Note we havent defined numerical gradients (which might be a good idea later), so we will use
# the COBYLA algorithm that can estimate them automatically
nLMLcriterion = lambda params: nLML(params, theta, dist_whitened, verbose=False)
newparams = scipy.optimize.minimize(nLMLcriterion, params0, method='COBYLA') # gradient free

# Define a prior
alpha = newparams.x[0]**2
beta  = newparams.x[1]**2

# Query the model 
f_mean, f_std = linreg(theta, dist_whitened, thetaQuery, alpha, beta)

#Unwhiten
f_mean_unwhite = f_mean*dist_std
f_std_unwhite = f_std*dist_std
speedQuery_unwhite = speedQuery*speed_std

In [32]:
# plot the mean, and +- 2 standard deviations
pl.figure(figsize=(15,5))
pl.plot(speed,stoppingDist,'k.')
pl.plot(speedQuery_unwhite, f_mean_unwhite, 'b')
pl.plot(speedQuery_unwhite, f_mean_unwhite+2.5*f_std_unwhite, 'r')
pl.plot(speedQuery_unwhite, f_mean_unwhite-2.5*f_std_unwhite, 'r')
pl.grid()
pl.xlabel('Input x')
pl.ylabel('Output y')
pl.title('Predictive Envelope with Data and Mean')
pl.show()



In [33]:
#Query the model at the speed of interest (21km/h)

speedQuery = 21/speed_std #(whitened)

thetaQuery = radialFeatureGen(speedQuery,mu,s)

# Query the model 
f_mean, f_std = linreg(theta, dist_whitened, thetaQuery, alpha, beta)

#Unwhiten
f_mean_unwhite = f_mean*dist_std
f_std_unwhite = f_std*dist_std
speedQuery_unwhite = speedQuery*speed_std

In [34]:
# Determine the probability mass to the left of 135 metres in a 
# normal distribution with mean: predStoppingDistMean and standard deviation: predStoppingDistStd 
predStoppingDistMean = f_mean_unwhite
predStoppingDistStd = f_std_unwhite
distToWall = 135
prob = stats.norm.cdf(distToWall,predStoppingDistMean,predStoppingDistStd)

In [35]:
print('Probability of stopping before hitting the wall: '+ str(prob[0,0]))


Probability of stopping before hitting the wall: 0.159299284463