In [1]:
# Import some libraries that will be necessary for working with data and displaying plots
# To visualize plots in the notebook
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.io # To read matlab files
import pylab
In the previous session we tackled the problem of fitting the following model using a LS criterion:
$${\hat s}({\bf x}) = f({\bf x}) = {\bf w}^\top {\bf z}$$where ${\bf z}$ is a vector with components which can be computed directly from the observed variables. Such model is includes a linear regression problem, where ${\bf z} = [1; {\bf x}]$, as well as any other non-linear model as long as it can be expressed as a "linear in the parameters" model.
The LS solution was defined as the one minimizing the square of the residuals over the training set $\{{\bf x}^{(k)}, s^{(k)}\}_{k=1}^K$. As a result, a single parameter vector ${\bf w}^*$ was obtained, and correspondingly a single regression curve.
In this session, rather than trying to obtain the best single model, we will work with a family of models or functions, and model the problem probabilistically, so that we can assign a probability value to each of the possible functions.
Consider the same regression task of the previous session. We have a training dataset consisting of 15 points which are given, and depict the regression curves that would be obtained if adding an additional point at a fixed location, depending on the target value of that point:
(You can run this code fragment several times, to check also the changes in the regression curves between executions, and depending also on the location of the training points)
In [42]:
n_points = 15
n_grid = 200
frec = 3
std_n = 0.2
n_val_16 = 5
degree = 12
X_tr = 3 * np.random.random((n_points,1)) - 0.5
S_tr = - np.cos(frec*X_tr) + std_n * np.random.randn(n_points,1)
X_grid = np.linspace(-.5,2.5,n_grid)
S_grid = - np.cos(frec*X_grid) #Noise free for the true model
X_16 = .3 * np.ones((n_val_16,))
S_16 = np.linspace(np.min(S_tr),np.max(S_tr),n_val_16)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(X_tr,S_tr,'b.',markersize=10)
ax.plot(X_16,S_16,'ro',markersize=6)
ax.plot(X_grid,S_grid,'r-',label='True model')
for el in zip(X_16,S_16):
#Add point to the training set
X_tr_iter = np.append(X_tr,el[0])
S_tr_iter = np.append(S_tr,el[1])
#Obtain LS regression coefficients and evaluate it at X_grid
w_LS = np.polyfit(X_tr_iter, S_tr_iter, degree)
S_grid_iter = np.polyval(w_LS,X_grid)
ax.plot(X_grid,S_grid_iter,'g-')
ax.set_xlim(-.5,2.5)
ax.set_ylim(S_16[0]-2,S_16[-1]+2)
ax.legend(loc='best')
Out[42]:
The key missing ingredient is assumptions !!
Do we think that all models are equally probable... before we see any data? What does the term model probability mean?
Do we need to choose a single "best" model or can we consider several simultaneously?
Perhaps our training targets are contaminated with noise. What to do?
We will start postulating a generative model for the training data that includes the presence of noise contaminating the targets, and work on this model to partly answer the other two questions.
Denoting by $f({\bf x}) = {{\bf w}}^\top {\bf z}$ the true function that we would like to obtain, we could assume that the observations in the training set are obtained as noisy values of the output of such function, i.e.,
$$s^{(k)} = f({\bf x}^{(k)}) + \varepsilon^{(k)}$$We will further characterize the noise values as i.i.d. and normally distributed, with mean zero, and variance $\sigma_\varepsilon^2$, i.e.,
$$\varepsilon \sim {\cal N}\left(0, \sigma_\varepsilon^2\right)$$Joint distribution of the noise samples, ${\pmb \varepsilon} = \left[\varepsilon^{(1)}, \dots, \varepsilon^{(K)}\right]^\top$:
$${\pmb \varepsilon} \sim \left( {\bf 0}, \sigma_{\varepsilon}^2 {\bf I}\right) \;\;\; p({\pmb \varepsilon}) = \left(\frac{1}{\sqrt{2\pi \sigma_{\varepsilon}^2}}\right)^K \exp\left(- \frac{{\pmb \varepsilon}^\top {\pmb \varepsilon}}{2 \sigma_{\varepsilon}^2}\right)$$
Denoting ${\bf s} = \left[s^{(1)}, \dots, s^{(K)} \right]^\top$ and ${\bf f} = \left[ f({\bf x}^{(1)}), \dots,f({\bf x}^{(K)})\right]^\top$, we have
$${\bf s} = {\bf f} + {\pmb \varepsilon}$$
Conditioning on the values of the target function, ${\bf f}$, the pdf of the available targets is obtained as a shifted version of the distribution of the noise. More precisely:
\begin{align}p({\bf s}|{\bf f}) & = \left(\frac{1}{\sqrt{2\pi \sigma_{\varepsilon}^2}}\right)^K \exp\left(- \frac{\|{\bf s} - {\bf f}\|^2}{2 \sigma_{\varepsilon}^2}\right) \end{align}
For the particular parametric selection of $f({\bf x})$, ${\bf f} = {\bf Z} {\bf w}$, conditioning on ${\bf f}$ is equivalent to conditioning on ${\bf w}$, so that:
$$p({\bf s}|{\bf f}) = p({\bf s}|{\bf w}) = \left(\frac{1}{\sqrt{2\pi \sigma_{\varepsilon}^2}}\right)^K \exp\left(- \frac{\|{\bf s} - {\bf Z}{\bf w}\|^2}{2 \sigma_{\varepsilon}^2}\right)$$
The maximum likelihood solution is then given by:
$${\bf w}_{ML} = \arg \max_{\bf w} p({\bf s}|{\bf w}) = \arg \min_{\bf w} \|{\bf s} - {\bf Z}{\bf w}\|^2$$With an additive Gaussian independent noise model, the maximum likelihood and the least squares solutions are the same. We have not improved much ...
However, we have already formulated the problem in a probabilistic way. This opens the door to reasoning in terms of a set of possible explanations, not just one. We believe more than one of our models could have generated the data.
If we express our a priori belief of models using a prior distribution $p({\bf f})$, then we can infer the a posteriori distribution using Bayes' rule:
$$p({\bf f}|{\bf s}) = \frac{p({\bf s}|{\bf f})~p({\bf f})}{p({\bf s})}$$
In the previous expression:
For the parametric model ${\bf f} = {\bf Z} {\bf w}$, the previous expressions become:
$$p({\bf w}|{\bf s}) = \frac{p({\bf s}|{\bf w})~p({\bf w})}{p({\bf s})}$$
Where:
Following a ML approach, we retain a single model, ${\bf w}_{ML} = \arg \max_{\bf w} p({\bf s}|{\bf w})$. Then, the predictive distribution of the target value for a new point would be obtained as:
$$p({s^*}|{\bf w}_{ML},{\bf x}^*) $$
For the generative model of Section 3.1.2 (additive i.i.d. Gaussian noise), this distribution is:
$$p({s^*}|{\bf w}_{ML},{\bf x}^*) = \frac{1}{\sqrt{2\pi\sigma_\varepsilon^2}} \exp \left(-\frac{\left(s^* - {\bf w}_{ML}^\top {\bf z}^*\right)^2}{2 \sigma_\varepsilon^2} \right)$$
The mean of $s^*$ is just the same as the prediction of the LS model, and the same uncertainty is assumed independently of the observation vector (i.e., the variance of the noise of the model).
If a single value is to be kept, we would probably keep the mean of the distribution, which is equivalent to the LS prediction.
Using Bayesian inference, we retain all models. Then, the inference of the value $s^* = s({\bf x}^*)$ is carried out by mixing all models, according to the weights given by the posterior distribution.
\begin{align}p({s^}|{\bf x}^,{\bf s}) & = \int p({s^},{\bf w}~|~{\bf x}^,{\bf s}) d{\bf w} \
& = \int p({s^*}~|~{\bf w},{\bf x}^*,{\bf s}) p({\bf w}~|~{\bf x}^*,{\bf s}) d{\bf w} \\
& = \int p({s^*}~|~{\bf w},{\bf x}^*) p({\bf w}~|~{\bf s}) d{\bf w}\end{align}
where:
In this section, we consider a particular example in which we assume the following prior for the weights:
$${\bf w} \sim {\cal N}\left({\bf 0},{\pmb \Sigma}_{p} \right)$$The following figure shows functions which are generated by drawing points from this distribution
In [4]:
n_points = 15
n_grid = 200
frec = 3
std_n = 0.2
degree = 12
nplots = 6
#Prior distribution parameters
sigma_eps = 0.1
mean_w = np.zeros((degree+1,))
sigma_w = 0.3
var_w = sigma_w * np.eye(degree+1)
X_tr = 3 * np.random.random((n_points,1)) - 0.5
S_tr = - np.cos(frec*X_tr) + std_n * np.random.randn(n_points,1)
X_grid = np.linspace(-.5,2.5,n_grid)
S_grid = - np.cos(frec*X_grid) #Noise free for the true model
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(X_tr,S_tr,'b.',markersize=10)
for k in range(nplots):
#Draw weigths fromt the prior distribution
w_iter = np.random.multivariate_normal(mean_w,var_w)
S_grid_iter = np.polyval(w_iter,X_grid)
ax.plot(X_grid,S_grid_iter,'g-')
ax.set_xlim(-.5,2.5)
ax.set_ylim(S_16[0]-2,S_16[-1]+2)
Out[4]:
According to the generative model, ${\bf s} = {\bf Z}{\bf w} + {\pmb \varepsilon}$
$${\bf s}~|~{\bf w} \sim {\cal N}\left({\bf Z}{\bf w},\sigma_\varepsilon^2 {\bf I} \right)$$Since both $p({\bf s}|{\bf w})$ and $p({\bf w})$ follow a Gaussian distribution, we know also that the joint distribution and the posterior distribution of ${\bf w}$ given ${\bf s}$ are also Gaussian. Therefore,
$${\bf w}~|~{\bf s} \sim {\cal N}\left({\bar{\bf w}} , {\pmb\Sigma}_{\bf w}\right)$$where the mean and the covariance matrix of the distribution are to be determined.
Exercise:
Show that the posterior mean and posterior covariance matrix of ${\bf w}$ given ${\bf s}$ are:
$${\pmb\Sigma}_{\bf w} = \left[\frac{1}{\sigma_\varepsilon^2} {\bf Z}^{\top}{\bf Z} + {\pmb \Sigma}_p^{-1}\right]^{-1}$$$${\bar{\bf w}} = {\sigma_\varepsilon^{-2}} {\pmb\Sigma}_{\bf w} {\bf Z}^\top {\bf s}$$The following fragment of code draws random vectors from $p({\bf w}|{\bf s})$, and plots the corresponding regression curve along with the training points. Compare these curves with those extracted from the prior distribution of ${\bf w}$.
In [46]:
n_points = 15
n_grid = 200
frec = 3
std_n = 0.2
degree = 12
nplots = 6
#Prior distribution parameters
sigma_eps = 0.1
mean_w = np.zeros((degree+1,))
sigma_p = .3 * np.eye(degree+1)
X_tr = 3 * np.random.random((n_points,1)) - 0.5
S_tr = - np.cos(frec*X_tr) + std_n * np.random.randn(n_points,1)
X_grid = np.linspace(-.5,2.5,n_grid)
S_grid = - np.cos(frec*X_grid) #Noise free for the true model
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(X_tr,S_tr,'b.',markersize=10)
#Compute matrix with training input data for the polynomial model
Z = []
for x_val in X_tr.tolist():
Z.append([x_val[0]**k for k in range(degree+1)])
Z=np.asmatrix(Z)
#Compute posterior distribution parameters
Sigma_w = np.linalg.inv(np.dot(Z.T,Z)/(sigma_eps**2) + np.linalg.inv(sigma_p))
posterior_mean = Sigma_w.dot(Z.T).dot(S_tr)/(sigma_eps**2)
posterior_mean = np.array(posterior_mean).flatten()
for k in range(nplots):
#Draw weights from the posterior distribution
w_iter = np.random.multivariate_normal(posterior_mean,Sigma_w)
#Note that polyval assumes the first element of weight vector is the coefficient of
#the highest degree term. Thus, we need to reverse w_iter
S_grid_iter = np.polyval(w_iter[::-1],X_grid)
ax.plot(X_grid,S_grid_iter,'g-')
#We plot also the least square solution
w_LS = np.polyfit(X_tr.flatten(), S_tr.flatten(), degree)
S_grid_iter = np.polyval(w_LS,X_grid)
ax.plot(X_grid,S_grid_iter,'m-',label='LS regression')
ax.set_xlim(-.5,2.5)
ax.set_ylim(S_16[0]-2,S_16[-1]+2)
ax.legend(loc='best')
Out[46]:
Since $f^* = f({\bf x}^*) = [{\bf x}^*]^\top {bf w}$, $f^*$ is also a Gaussian variable whose posterior mean and variance can be calculated as follows:
$$\mathbb{E}\{{{\bf x}^*}^\top {\bf w}~|~{\bf s}, {\bf x}^*\} = {{\bf x}^*}^\top \mathbb{E}\{{\bf w}|{\bf s}\} = {\sigma_\varepsilon^{-2}} {{\bf x}^*}^\top {\pmb\Sigma}_{\bf w} {\bf Z}^\top {\bf s}$$
$$\text{Cov}\left[{{\bf x}^*}^\top {\bf w}~|~{\bf s}, {\bf x}^*\right] = {{\bf x}^*}^\top \text{Cov}\left[{\bf w}~|~{\bf s}\right] {{\bf x}^*} = {{\bf x}^*}^\top {\pmb \Sigma}_{\bf w} {{\bf x}^*}$$
Therefore, $f^*~|~{\bf s}, {\bf x}^* \sim {\cal N}\left({\sigma_\varepsilon^{-2}} {{\bf x}^*}^\top {\pmb\Sigma}_{\bf w} {\bf Z}^\top {\bf s}, {{\bf x}^*}^\top {\pmb \Sigma}_{\bf w} {{\bf x}^*} \right)$
Finally, for $s^* = f^* + \varepsilon^*$, the posterior distribution is $s^*~|~{\bf s}, {\bf x}^* \sim {\cal N}\left({\sigma_\varepsilon^{-2}} {{\bf x}^*}^\top {\pmb\Sigma}_{\bf w} {\bf Z}^\top {\bf s}, {{\bf x}^*}^\top {\pmb \Sigma}_{\bf w} {{\bf x}^*} + \sigma_\varepsilon^2\right)$
In [76]:
n_points = 15
n_grid = 200
frec = 3
std_n = 0.2
degree = 12
nplots = 6
#Prior distribution parameters
sigma_eps = 0.1
mean_w = np.zeros((degree+1,))
sigma_p = .5 * np.eye(degree+1)
X_tr = 3 * np.random.random((n_points,1)) - 0.5
S_tr = - np.cos(frec*X_tr) + std_n * np.random.randn(n_points,1)
X_grid = np.linspace(-.5,2.5,n_grid)
S_grid = - np.cos(frec*X_grid) #Noise free for the true model
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(X_tr,S_tr,'b.',markersize=10)
#Compute matrix with training input data for the polynomial model
Z = []
for x_val in X_tr.tolist():
Z.append([x_val[0]**k for k in range(degree+1)])
Z=np.asmatrix(Z)
#Compute posterior distribution parameters
Sigma_w = np.linalg.inv(np.dot(Z.T,Z)/(sigma_eps**2) + np.linalg.inv(sigma_p))
posterior_mean = Sigma_w.dot(Z.T).dot(S_tr)/(sigma_eps**2)
posterior_mean = np.array(posterior_mean).flatten()
#Plot the posterior mean
#Note that polyval assumes the first element of weight vector is the coefficient of
#the highest degree term. Thus, we need to reverse w_iter
S_grid_iter = np.polyval(posterior_mean[::-1],X_grid)
ax.plot(X_grid,S_grid_iter,'g-',label='Predictive mean, BI')
#Plot confidence intervals for the Bayesian Inference
std_x = []
for el in X_grid:
x_ast = np.array([el**k for k in range(degree+1)])
std_x.append(np.sqrt(x_ast.dot(Sigma_w).dot(x_ast)[0,0]))
std_x = np.array(std_x)
plt.fill_between(X_grid, S_grid_iter-std_x, S_grid_iter+std_x,
alpha=0.2, edgecolor='#1B2ACC', facecolor='#089FFF',
linewidth=4, linestyle='dashdot', antialiased=True)
#We plot also the least square solution
w_LS = np.polyfit(X_tr.flatten(), S_tr.flatten(), degree)
S_grid_iter = np.polyval(w_LS,X_grid)
ax.plot(X_grid,S_grid_iter,'m-',label='LS regression')
ax.set_xlim(-.5,2.5)
ax.set_ylim(S_16[0]-2,S_16[-1]+2)
ax.legend(loc='best')
Out[76]:
Not only do we obtain a better predictive model, but we also have confidence intervals (error bars) for the predictions.
In [ ]: