Changes: v.1.0 - First version. Python version
v.1.1 - Python 3 compatibility. ML section.
v.1.2 - Revised content. 2D visualization removed.
Pending changes:
In [1]:
# Import some libraries that will be necessary for working with data and displaying plots
# To visualize plots in the notebook
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import scipy.io # To read matlab files
from scipy import spatial
import pylab
pylab.rcParams['figure.figsize'] = 8, 5
In this exercise the student will review several key concepts of Maximum Likelihood and Bayesian regression. To do so, we will assume the regression model
$$s = f({\bf x}) + \varepsilon$$where $s$ is the output corresponding to input ${\bf x}$, $f({\bf x})$ is an unobservable latent function, and $\varepsilon$ is white zero-mean Gaussian noise, i.e.,
$$\varepsilon \sim {\cal N}(0,\sigma_\varepsilon^2).$$In addition, we will assume that the latent function is linear in the parameters
$$f({\bf x}) = {\bf w}^\top {\bf z}$$where ${\bf z} = T({\bf x})$ is a possibly non-linear transformation of the input. Along this notebook, we will explore different types of transformations.
Also, we will assume an a priori distribution for ${\bf w}$ given by
$${\bf w} \sim {\cal N}({\bf 0}, \sigma_p^2~{\bf I})$$Though sometimes unavoidable, it is recommended not to use explicit matrix inversion whenever possible. For instance, if an operation like ${\mathbf A}^{-1} {\mathbf b}$ must be performed, it is preferable to code it using python $\mbox{numpy.linalg.lstsq}$ function (see http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html), which provides the LS solution to the overdetermined system ${\mathbf A} {\mathbf w} = {\mathbf b}$.
Sometimes, the computation of $\log|{\mathbf A}|$ (where ${\mathbf A}$ is a positive definite matrix) can overflow available precision, producing incorrect results. A numerically more stable alternative, providing the same result is $2\sum_i \log([{\mathbf L}]_{ii})$, where $\mathbf L$ is the Cholesky decomposition of $\mathbf A$ (i.e., ${\mathbf A} = {\mathbf L}^\top {\mathbf L}$), and $[{\mathbf L}]_{ii}$ is the $i$th element of the diagonal of ${\mathbf L}$.
Non-degenerate covariance matrices, such as the ones in this exercise, are always positive definite. It may happen, as a consequence of chained rounding errors, that a matrix which was mathematically expected to be positive definite, turns out not to be so. This implies its Cholesky decomposition will not be available. A quick way to palliate this problem is by adding a small number (such as $10^{-6}$) to the diagonal of such matrix.
In [2]:
np.random.seed(3)
First, we are going to generate synthetic data (so that we have the ground-truth model) and use them to make sure everything works correctly and our estimations are sensible.
sigma_p
and sigma_eps
containing the respectiv standard deviations.
In [3]:
# Parameter settings
# sigma_p = <FILL IN>
sigma_p = np.sqrt(2)
# sigma_eps = <FILL IN>
sigma_eps = np.sqrt(0.2)
In [4]:
# Data dimension:
dim_x = 2
# Generate a parameter vector taking a random sample from the prior distributions
# (the np.random module may be usefull for this purpose)
# true_w = <FILL IN>
true_w = sigma_p * np.random.randn(dim_x, 1)
print('The true parameter vector is:')
print(true_w)
linspace
from numpy can be useful for this)
In [5]:
# <SOL>
# Parameter settings
x_min = 0
x_max = 2
n_points = 21
# Training datapoints
X = np.linspace(x_min, x_max, n_points)[:,np.newaxis]
# </SOL>
In [6]:
# Expand input matrix with an all-ones column
col_1 = np.ones((n_points, 1))
# Z = <FILL IN>
Z = np.hstack((col_1,X))
# Generate values of the target variable
# s = <FILL IN>
s = np.dot(Z, true_w) + sigma_eps * np.random.randn(n_points, 1)
In [7]:
# <SOL>
# Plot training points
plt.scatter(X, s);
plt.xlabel('$x$',fontsize=14);
plt.ylabel('$s$',fontsize=14);
# </SOL>
predict(we, Z)
that computes the linear predictions for all inputs in data matrix Z
(a 2-D numpy arry), for a given parameter vector we
(a 1-D numpy array). The output should be a 1-D array. Test your function with the given dataset and we = [0.4, 0.7]
In [8]:
# <SOL>
# Prediction function
def predict(Z, w):
return Z.dot(w)
we = np.array([0.4, 0.7])
p = predict(Z, we)
# </SOL>
# Print predictions
print(p)
sse(we, Z, s)
that computes the sum of squared errors (SSE) for the linear prediction with parameters we
(1D numpy array), inputs Z
(2D numpy array) and targets s
(1D numpy array). Using this function, compute the SSE of the true parameter vector in true_w
.
In [9]:
# <SOL>
# Sum of Squared Errors
def sse(Z, s, w):
return np.sum((s - predict(Z, w))**2)
SSE = sse(Z, s, true_w)
# </SOL>
print(" The SSE is: {0}".format(SSE))
likelihood(we, Z, s, sigma_eps)
that computes the likelihood of parameter vector we
for a given dataset in matrix Z
and vector s
, assuming Gaussian noise with varianze $\sigma_\epsilon^2$. Note that this function can use the sse
function defined above. Using this function, compute the likelihood of the true parameter vector in true_w
.
In [10]:
# <SOL>
# The plot: LHS is the data, RHS will be the cost function.
def likelihood(w, Z, s, sigma_eps):
K = len(s)
lw = 1.0 / (np.sqrt(2*np.pi)*sigma_eps)**K * np.exp(- sse(Z, s, w)/(2*sigma_eps**2))
return lw
L_w_true = likelihood(true_w, Z, s, sigma_eps)
# </SOL>
print("The likelihood of the true parameter vector is {0}".format(L_w_true))
[4] Define a function LL(we, Xe, s)
that computes the log-likelihood of parameter vector we
for a given dataset in matrix Z
and vector s
. Note that this function can use the likelihood
function defined above. However, for a highe numerical precission, implemening a direct expression for the log-likelihood is recommended.
Using this function, compute the likelihood of the true parameter vector in true_w
.
In [11]:
# <SOL>
# The plot: LHS is the data, RHS will be the cost function.
def LL(w, Z, s, sigma_eps):
K = len(s)
Lw = - K * np.log(np.sqrt(2*np.pi)*sigma_eps) - sse(Z, s, w)/(2*sigma_eps**2)
return Lw
LL_w_true = LL(true_w, Z, s, sigma_eps)
# </SOL>
print("The log-likelihood of the true parameter vector is {0}".format(LL_w_true))
In [12]:
# <SOL>
w_ML, _, _, _ = np.linalg.lstsq(Z, s, rcond=None)
# </SOL>
print(w_ML)
In [13]:
# <SOL>
L_w_ML = likelihood(w_ML, Z, s, sigma_eps)
LL_w_ML = LL(w_ML, Z, s, sigma_eps)
# </SOL>
print('Maximum likelihood: {0}'.format(L_w_ML))
print('Maximum log-likelihood: {0}'.format(LL_w_ML))
Just as an illustration, the code below generates a set of points in a two dimensional grid going from $(-\sigma_p, -\sigma_p)$ to $(\sigma_p, \sigma_p)$, computes the log-likelihood for all these points and visualize them using a 2-dimensional plot. You can see the difference between the true value of the parameter ${\bf w}$ (black) and the ML estimate (red).
In [14]:
# First construct a grid of (theta0, theta1) parameter pairs and their
# corresponding cost function values.
N = 200 # Number of points along each dimension.
w0_grid = np.linspace(-2.5*sigma_p, 2.5*sigma_p, N)
w1_grid = np.linspace(-2.5*sigma_p, 2.5*sigma_p, N)
Lw = np.zeros((N,N))
# Fill Lw with the likelihood values
for i, w0i in enumerate(w0_grid):
for j, w1j in enumerate(w1_grid):
we = np.array((w0i, w1j))
Lw[i, j] = LL(we, Z, s, sigma_eps)
WW0, WW1 = np.meshgrid(w0_grid, w1_grid, indexing='ij')
contours = plt.contour(WW0, WW1, Lw, 20)
plt.figure
plt.clabel(contours)
plt.scatter([true_w[0]]*2, [true_w[1]]*2, s=[50,10], color=['k','w'])
plt.scatter([w_ML[0]]*2, [w_ML[1]]*2, s=[50,10], color=['r','w'])
plt.xlabel('$w_0$')
plt.ylabel('$w_1$')
plt.show()
Note that the likelihood of the true parameter vector is, in general, smaller than that of the ML estimate. However, as the sample size increasis, both should converge to the same value.
X2
and the 1D-array s2
In [ ]:
# Parameter settings
x_min = 0
x_max = 2
n_points = 2**16
# <SOL>
# Training datapoints
X2 = np.linspace(x_min, x_max, n_points)
# Expand input matrix with an all-ones column
col_1 = np.ones((n_points,))
Xe2 = np.vstack((col_1, X2)).T
s2 = Xe2.dot(true_w) + sigma_eps * np.random.randn(n_points)
# </SOL>
In [ ]:
# <SOL>
e2 = []
for k in range(3, 16):
Xk = Xe2[0:2**k, :]
sk = s2[0:2**k]
w_MLk, _, _, _ = np.linalg.lstsq(Xk, sk)
e2.append(np.sum((true_w - w_MLk)**2))
plt.semilogy(e2)
plt.show()
# </SOL>
In [ ]:
# <SOL>
matvar = scipy.io.loadmat('DatosLabReg.mat')
Xtrain = matvar['Xtrain']
Xtest = matvar['Xtest']
Ytrain = matvar['Ytrain']
Ytest = matvar['Ytest']
# </SOL>
Xtrain
and Xtest
, respectively.
In [ ]:
# <SOL>
# Data normalization
mean_x = np.mean(Xtrain,axis=0)
std_x = np.std(Xtrain,axis=0)
Xtrain = (Xtrain - mean_x) / std_x
Xtest = (Xtest - mean_x) / std_x
# </SOL>
Xtrain
and Xtest
into arrays X0train
and X0test
, respectively.
In [ ]:
In [ ]:
[3] Since the data have been taken from a real scenario, we do not have any true mathematical model of the process that generated the data. Thus, we will explore different models trying to take the one that fits better the training data.
Assume a polinomial model given by $$ {\bf z} = T({\bf x}) = (1, x_0, x_0^2, \ldots, x_0^{g-1})^\top. $$
Built a method
Ztrain, Ztest = T_poly(Xtrain, Xtest, g)
that, for a given value of $g$, computes normailzed data matrices Ztrain
and Ztest
that result from applying the polinomial transformation to the inputs in X0train
and X0test
for an arbitrary value of $g$.
Note that, despite X0train
and X0test
where normalized, you will need to re-normalize the transformed variables.
In [ ]:
# <SOL>
# Extend input data matrices with a column of 1's
col_1 = np.ones((Xtrain.shape[0],1))
Ztrain = np.concatenate((col_1, Xtrain), axis = 1)
col_1 = np.ones((Xtest.shape[0],1))
Ztest = np.concatenate((col_1, Xtest), axis = 1)
# </SOL>
[4] Fit a polynomial model with degree $g$ for $g$ ranging from 0 to 10. Store the weights of all models in a list of weight vectors, named models
, such that models[g]
returns the parameters estimated for the polynomial model with degree $g$.
We will use these models in the following sections.
In [ ]:
[6] Show, in the same plot:
In [ ]:
[7] [OPTIONAL] You may have seen the the likelihood function grows with the degree of the polynomial. However, large values of $g$ produce a strong data overfitting. For this reasong, $g$ cannot be selected with the same data used to fit the model.
This kind of parameters, like $g$ are usually called hyperparameters and need to be selected by cross validation.
Another hyperparameter is $\sigma_\epsilon^2$. Plot the log-likelihood function corresponding to the polynomial model with degree 3 for different values of $\sigma_\epsilon^2$, for the training set and the test set. What would be the optimal value of this hyperparameters according to the training set?
In any case, not that the model coefficients do not depend on $\sigma_eps^2$. Therefore, we do not need to estimat its value for ML regression.
In [ ]:
In [ ]:
[9] For the selected model:
In [ ]:
In this section we will keep using the first component of the data from the stock dataset, assuming the same kind of plolynomial model. We will explore the potential advantages of using a Bayesian model. To do so, we will asume that the a priori distribution of ${\bf w}$ is
$${\bf w} \sim {\cal N}({\bf 0}, \sigma_p^2~{\bf I})$$posterior_stats(Z, s, sigma_eps, sigma_p)
that computes the parameters of the posterior coefficient distribution given the dataset in matrix Z
and vector s
, for given values of the hyperparameters.
This function should return the posterior mean, the covariance matrix and the precision matrix (the inverse of the covariance matrix). Test the function to the given dataset, for $g=3$.
In [ ]:
# <SOL>
def posterior_stats(Z, s, sigma_eps, sigma_p):
dim_w = Z.shape[1]
iCov_w = Z.T.dot(Z)/(sigma_eps**2) + np.eye(dim_w, dim_w)/(sigma_p**2)
Cov_w = np.linalg.inv(iCov_w)
mean_w = Cov_w.dot(Z.T).dot(s)/(sigma_eps**2)
return mean_w, Cov_w, iCov_w
# </SOL>
mean_w, Cov_w, iCov_w = posterior_stats(Z, s, sigma_eps, sigma_p)
print('true_w = {0}'.format(true_w))
print('mean_w = {0}'.format(mean_w))
print('Cov_w = {0}'.format(Cov_w))
print('iCov_w = {0}'.format(iCov_w))
gauss_pdf(we, mean_w, iCov_w)
that computes the Gaussian pdf with mean mean_w
and precision matrix iCov_w
. Use this function to compute and compare the posterior pdf value of the true coefficients, the ML estimate and the MSE estimate, given the dataset.
In [ ]:
# <SOL>
def gauss_pdf(w, mean_w, iCov_w):
d = w - mean_w
w_dim = len(mean_w)
pw = np.sqrt(np.linalg.det(iCov_w)) / (2*np.pi)**(w_dim/2) * np.exp(- d.T.dot(iCov_w.dot(d))/2)
return pw
# </SOL>
print('p(true_w | s) = {0}'.format(gauss_pdf(true_w, mean_w, iCov_w)))
print('p(w_ML | s) = {0}'.format(gauss_pdf(w_ML, mean_w, iCov_w)))
print('p(w_MSE | s) = {0}'.format(gauss_pdf(mean_w, mean_w, iCov_w)))
log_gauss_pdf(we, mean_w, iCov_w)
that computes the log of the Gaussian pdf with mean mean_w
and precision matrix iCov_w
. Use this function to compute and compare the log of the posterior pdf value of the true coefficients, the ML estimate and the MSE estimate, given the dataset.Since the values $\sigma_p$ and $\sigma_\varepsilon$ are no longer known, we have to select them in some way. To see their influence, assume $g=3$ and plot the regression function for different values of $\sigma_p$
To this end, we will adjust them using the LS solution to the regression problem:
In [ ]:
Since the values $\sigma_p$ and $\sigma_\varepsilon$ are no longer known, a first rough estimation is needed (we will soon see how to estimate these values in a principled way).
To this end, we will adjust them using the LS solution to the regression problem:
In [ ]:
# w_LS, residuals, rank, s = <FILL IN>
w_LS, residuals, rank, s = np.linalg.lstsq(Ztrain, Ytrain)
# sigma_p = <FILL IN>
sigma_p = np.sqrt(np.mean(w_LS**2))
# sigma_eps = <FILL IN>
sigma_eps = np.sqrt(2*np.mean((Ytrain - Ztrain.dot(w_LS))**2))
print(sigma_eps)
In this section we will plot the functions corresponding to different samples drawn from the posterior distribution of the weight vector.
To this end, we will first generate an input dataset of equally spaced samples. We will compute the functions at these points
In [ ]:
# Definition of the interval for representation purposes
x2_min = -1
x2_max = 3
n_points = 100 # Only two points are needed to plot a straigh line
# Build the input data matrix:
# Input values for representation of the regression curves
X2 = np.linspace(x2_min, x2_max, n_points)
col_1 = np.ones((n_points,))
X2e = np.vstack((col_1, X2)).T
Generate random vectors ${\bf w}_l$ with $l = 1,\dots, 50$, from the posterior density of the weights, $p({\bf w}\mid{\bf s})$, and use them to generate 50 polinomial regression functions, $f({\bf x}^\ast) = {{\bf z}^\ast}^\top {\bf w}_l$, with ${\bf x}^\ast$ between $-1.2$ and $1.2$, with step $0.1$.
Plot the line corresponding to the model with the posterior mean parameters, along with the $50$ generated straight lines and the original samples, all in the same plot. As you can check, the Bayesian model is not providing a single answer, but instead a density over them, from which we have extracted 50 options.
In [ ]:
# Drawing weights from the posterior
# First, compute the cholesky decomposition of the covariance matrix
# L = <FILL IN>
L = np.linalg.cholesky(Cov_w)
for l in range(50):
# Generate a random sample from the posterior distribution
# w_l = <FILL IN>
w_l = L.dot(np.random.randn(dim_x)) + mean_w
# Compute predictions for the inputs in the data matrix
# p_l = <FILL IN>
p_l = X2e.dot(w_l)
# Plot prediction function
# plt.plot(<FILL IN>, 'c:');
plt.plot(X2, p_l, 'c:');
# Compute predictions for the inputs in the data matrix and using the true model
# p_truew = <FILL IN>
p_truew = X2e.dot(true_w)
# Plot the true model
plt.plot(X2, p_truew, 'b', label='True model', linewidth=2);
# Plot the training points
plt.plot(X,s,'r.',markersize=12);
plt.xlim((x2_min,x2_max));
plt.legend(loc='best')
plt.xlabel('$x$',fontsize=14);
plt.ylabel('$s$',fontsize=14);
On top of the previous figure (copy here your code from the previous section), plot functions
$${\mathbb E}\left\{f({\bf x}^\ast)\mid{\bf s}\right\}$$and
$${\mathbb E}\left\{f({\bf x}^\ast)\mid{\bf s}\right\} \pm 2 \sqrt{{\mathbb V}\left\{f({\bf x}^\ast)\mid{\bf s}\right\}}$$(i.e., the posterior mean of $f({\bf x}^\ast)$, as well as two standard deviations above and below).
It is possible to show analytically that this region comprises $95.45\%$ probability of the posterior probability $p(f({\bf x}^\ast)\mid {\bf s})$ at each ${\bf x}^\ast$.
In [ ]:
# Note that you can re-use code from sect. 4.2 to solve this exercise
# Plot sample functions from the posterior, and the training points
# <SOL>
# Drawing weights from the posterior
for l in range(50):
# Generate a random sample from the posterior distribution
# w_l = <FILL IN>
w_l = L.dot(np.random.randn(dim_x)) + mean_w
# Compute predictions for the inputs in the data matrix
# p_l = <FILL IN>
p_l = X2e.dot(w_l)
# Plot prediction function
# plt.plot(<FILL IN>, 'c:');
plt.plot(X2, X2e.dot(w_l), 'c:');
# Plot as well the training points and the true model
plt.plot(X2, X2e.dot(true_w), 'b', label='True model', linewidth=2);
plt.plot(X,s,'r.',markersize=12);
plt.xlim((x2_min,x2_max));
# </SOL>
# Plot the posterior mean.
# mean_ast = <FILL IN>
mean_ast = X2e.dot(mean_w)
plt.plot(X2, mean_ast, 'm', label='Predictive mean', linewidth=2);
# Plot the posterior mean \pm 2 std
# std_ast = <FILL IN>
std_ast = np.sqrt(np.diagonal(X2e.dot(Cov_w).dot(X2e.T)))
# plt.plot(<FILL IN>, 'm--', label='Predictive mean $\pm$ 2std', linewidth=2);
plt.plot(X2, mean_ast+2*std_ast, 'm--', label='Predictive mean $\pm$ 2std', linewidth=2);
# plt.plot(<FILL IN>, 'm--', linewidth=3);
plt.plot(X2, mean_ast-2*std_ast, 'm--', linewidth=3);
plt.legend(loc='best')
plt.xlabel('$x$',fontsize=14);
plt.ylabel('$s$',fontsize=14);
Plot now ${\mathbb E}\left\{s({\bf x}^\ast)\mid{\bf s}\right\} \pm 2 \sqrt{{\mathbb V}\left\{s({\bf x}^\ast)\mid{\bf s}\right\}}$ (note that the posterior means of $f({\bf x}^\ast)$ and $s({\bf x}^\ast)$ are the same, so there is no need to plot it again). Notice that $95.45\%$ of observed data lie now within the newly designated region. These new limits establish a confidence range for our predictions. See how the uncertainty grows as we move away from the interpolation region to the extrapolation areas.
In [ ]:
# Plot sample functions confidence intervals and sampling points
# Note that you can simply copy and paste most of the code used in the cell above.
# <SOL>
# Plot sample functions from the posterior, and the training points
# Drawing weights from the posterior
for l in range(50):
# Generate a random sample from the posterior distribution
w_l = L.dot(np.random.randn(dim_x)) + mean_w
# Compute predictions for the inputs in the data matrix
p_l = X2e.dot(w_l)
# Plot prediction function
plt.plot(X2, X2e.dot(w_l), 'c:');
# Plot as well the training points and the true model
plt.plot(X2, X2e.dot(true_w), 'b', label='True model', linewidth=2);
plt.plot(X,s,'r.',markersize=12);
plt.xlim((x2_min,x2_max));
# Plot the posterior mean.
mean_ast = X2e.dot(mean_w)
plt.plot(X2, mean_ast, 'm', label='Predictive mean', linewidth=2);
# Plot the posterior mean \pm 2 std
std_ast = np.sqrt(np.diagonal(X2e.dot(Cov_w).dot(X2e.T)))
plt.plot(X2, mean_ast + 2*std_ast, 'm--', label='Predictive mean $\pm$ 2std', linewidth=2);
plt.plot(X2, mean_ast - 2*std_ast, 'm--', linewidth=3);
# </SOL>
# Compute the standad deviations for s and plot the confidence intervals
# <SOL>
std_ast_eps = np.sqrt(np.diagonal(X2e.dot(Cov_w).dot(X2e.T)) + sigma_eps**2)
#Plot now the posterior mean and posterior mean \pm 2 std for s (i.e., adding the noise variance)
plt.plot(X2, mean_ast + 2*std_ast_eps, 'm:', label='Predictive mean of s $\pm$ 2std', linewidth=2);
plt.plot(X2, mean_ast - 2*std_ast_eps, 'm:', linewidth=2);
# </SOL>
plt.legend(loc='best')
plt.xlabel('$x$',fontsize=14);
plt.ylabel('$s$',fontsize=14);
[OPTIONAL. You can skip this section]
In order to verify the performance of the resulting model, compute the posterior mean and variance of each of the test outputs from the posterior over ${\bf w}$. I.e, compute ${\mathbb E}\left\{s({\bf x}^\ast)\mid{\bf s}\right\}$ and $\sqrt{{\mathbb V}\left\{s({\bf x}^\ast)\mid{\bf s}\right\}}$ for each test sample ${\bf x}^\ast$ contained in each row of Xtest
. Be sure not to use the outputs Ytest
at any point during this process.
Store the predictive mean and variance of all test samples in two column vectors called m_y
and v_y
, respectively.
In [ ]:
# <SOL>
m_y = Ztest.dot(mean_w)
v_y = np.diagonal(Ztest.dot(Cov_w).dot(Ztest.T)) + sigma_eps**2
v_y = np.matrix(v_y).T
# </SOL>
Compute now the mean square error (MSE) and the negative log-predictive density (NLPD) with the following code:
In [ ]:
# <SOL>
MSE = np.mean((m_y - Ytest)**2)
NLPD = 0.5 * np.mean(((Ytest - m_y)**2)/v_y) + 0.5*np.log(2*np.pi*v_y)
# </SOL>
Results should be:
In [ ]:
print('MSE = {0}'.format(MSE))
print('NLPD = {0}'.format(NLPD))
These two measures reveal the quality of our predictor (with lower values revealing higher quality). The first measure (MSE) only compares the predictive mean with the actual value and always has a positive value (if zero was reached, it would mean a perfect prediction). It does not take into account predictive variance. The second measure (NLPD) takes into account both the deviation and the predictive variance (uncertainty) to measure the quality of the probabilistic prediction (a high error in a prediction that was already known to have high variance has a smaller penalty, but also, announcing a high variance when the prediction error is small won’t award such a good score).