Bayesian regression

Notebook version: 1.0 (Oct 01, 2015)

Author: Jerónimo Arenas García (jarenas@tsc.uc3m.es)

Changes: v.1.0 - First version

Pending changes: * Include regression on the stock data

In [1]:
# Import some libraries that will be necessary for working with data and displaying plots

# To visualize plots in the notebook
%matplotlib inline 

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.io       # To read matlab files
import pylab

3. Bayesian regression

In the previous session we tackled the problem of fitting the following model using a LS criterion:

$${\hat s}({\bf x}) = f({\bf x}) = {\bf w}^\top {\bf z}$$

where ${\bf z}$ is a vector with components which can be computed directly from the observed variables. Such model is includes a linear regression problem, where ${\bf z} = [1; {\bf x}]$, as well as any other non-linear model as long as it can be expressed as a "linear in the parameters" model.

The LS solution was defined as the one minimizing the square of the residuals over the training set $\{{\bf x}^{(k)}, s^{(k)}\}_{k=1}^K$. As a result, a single parameter vector ${\bf w}^*$ was obtained, and correspondingly a single regression curve.

In this session, rather than trying to obtain the best single model, we will work with a family of models or functions, and model the problem probabilistically, so that we can assign a probability value to each of the possible functions.

3.1 Maximum Likelihood estimation of the weights

3.1.1 Limitations of the LS approach. The need for assumptions

Consider the same regression task of the previous session. We have a training dataset consisting of 15 points which are given, and depict the regression curves that would be obtained if adding an additional point at a fixed location, depending on the target value of that point:

(You can run this code fragment several times, to check also the changes in the regression curves between executions, and depending also on the location of the training points)


In [42]:
n_points = 15
n_grid = 200
frec = 3
std_n = 0.2
n_val_16 = 5
degree = 12

X_tr = 3 * np.random.random((n_points,1)) - 0.5
S_tr = - np.cos(frec*X_tr) + std_n * np.random.randn(n_points,1)
X_grid = np.linspace(-.5,2.5,n_grid)
S_grid = - np.cos(frec*X_grid) #Noise free for the true model

X_16 = .3 * np.ones((n_val_16,))
S_16 = np.linspace(np.min(S_tr),np.max(S_tr),n_val_16)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(X_tr,S_tr,'b.',markersize=10)
ax.plot(X_16,S_16,'ro',markersize=6)
ax.plot(X_grid,S_grid,'r-',label='True model')

for el in zip(X_16,S_16):
    #Add point to the training set
    X_tr_iter = np.append(X_tr,el[0])
    S_tr_iter = np.append(S_tr,el[1])
    
    #Obtain LS regression coefficients and evaluate it at X_grid
    w_LS = np.polyfit(X_tr_iter, S_tr_iter, degree)
    S_grid_iter = np.polyval(w_LS,X_grid)
    ax.plot(X_grid,S_grid_iter,'g-')

ax.set_xlim(-.5,2.5)
ax.set_ylim(S_16[0]-2,S_16[-1]+2)
ax.legend(loc='best')


Out[42]:
<matplotlib.legend.Legend at 0x10c8cbad0>
  • You can control the degree of the polynomia, and check that when the degree is set to 15 (16 weights) all points will be fitted perfectly
  • It seems obvious that we have not solved the problem ...
    • The regression curves overfit the training data
    • The regression curves change a lot when varying the label of just one pattern

The key missing ingredient is assumptions !!

Open questions

  • Do we think that all models are equally probable... before we see any data? What does the term model probability mean?

  • Do we need to choose a single "best" model or can we consider several simultaneously?

  • Perhaps our training targets are contaminated with noise. What to do?

We will start postulating a generative model for the training data that includes the presence of noise contaminating the targets, and work on this model to partly answer the other two questions.

 3.1.2 Generative model

Denoting by $f({\bf x}) = {{\bf w}}^\top {\bf z}$ the true function that we would like to obtain, we could assume that the observations in the training set are obtained as noisy values of the output of such function, i.e.,

$$s^{(k)} = f({\bf x}^{(k)}) + \varepsilon^{(k)}$$

We will further characterize the noise values as i.i.d. and normally distributed, with mean zero, and variance $\sigma_\varepsilon^2$, i.e.,

$$\varepsilon \sim {\cal N}\left(0, \sigma_\varepsilon^2\right)$$

3.1.3 The maximum likelihood solution

  • Joint distribution of the noise samples, ${\pmb \varepsilon} = \left[\varepsilon^{(1)}, \dots, \varepsilon^{(K)}\right]^\top$:

    $${\pmb \varepsilon} \sim \left( {\bf 0}, \sigma_{\varepsilon}^2 {\bf I}\right) \;\;\; p({\pmb \varepsilon}) = \left(\frac{1}{\sqrt{2\pi \sigma_{\varepsilon}^2}}\right)^K \exp\left(- \frac{{\pmb \varepsilon}^\top {\pmb \varepsilon}}{2 \sigma_{\varepsilon}^2}\right)$$

  • Denoting ${\bf s} = \left[s^{(1)}, \dots, s^{(K)} \right]^\top$ and ${\bf f} = \left[ f({\bf x}^{(1)}), \dots,f({\bf x}^{(K)})\right]^\top$, we have

    $${\bf s} = {\bf f} + {\pmb \varepsilon}$$

  • Conditioning on the values of the target function, ${\bf f}$, the pdf of the available targets is obtained as a shifted version of the distribution of the noise. More precisely:

    \begin{align}p({\bf s}|{\bf f}) & = \left(\frac{1}{\sqrt{2\pi \sigma_{\varepsilon}^2}}\right)^K \exp\left(- \frac{\|{\bf s} - {\bf f}\|^2}{2 \sigma_{\varepsilon}^2}\right) \end{align}

  • For the particular parametric selection of $f({\bf x})$, ${\bf f} = {\bf Z} {\bf w}$, conditioning on ${\bf f}$ is equivalent to conditioning on ${\bf w}$, so that:

    $$p({\bf s}|{\bf f}) = p({\bf s}|{\bf w}) = \left(\frac{1}{\sqrt{2\pi \sigma_{\varepsilon}^2}}\right)^K \exp\left(- \frac{\|{\bf s} - {\bf Z}{\bf w}\|^2}{2 \sigma_{\varepsilon}^2}\right)$$

  • The previous expression represents the probability of the observed targets given the weights, and is also known as the likelihood of the weights for a particular training set.

The maximum likelihood solution is then given by:

$${\bf w}_{ML} = \arg \max_{\bf w} p({\bf s}|{\bf w}) = \arg \min_{\bf w} \|{\bf s} - {\bf Z}{\bf w}\|^2$$

3.1.4 Multiple explanations of the data

With an additive Gaussian independent noise model, the maximum likelihood and the least squares solutions are the same. We have not improved much ...

However, we have already formulated the problem in a probabilistic way. This opens the door to reasoning in terms of a set of possible explanations, not just one. We believe more than one of our models could have generated the data.

  • We do not believe all models are equally likely to have generated the data
  • We may believe that a simpler model is more likely than a complex one

3.2 Bayesian Inference

3.2.1 Posterior distribution of weights

  • If we express our a priori belief of models using a prior distribution $p({\bf f})$, then we can infer the a posteriori distribution using Bayes' rule:

    $$p({\bf f}|{\bf s}) = \frac{p({\bf s}|{\bf f})~p({\bf f})}{p({\bf s})}$$

    In the previous expression:

    • $p({\bf s}|{\bf f})$: is the likelihood function
    • $p({\bf f})$: is the prior distribution of the models (assumptions are needed here)
    • $p({\bf s})$: is the marginal distribution of the observed data, which could be obtained integrating the numerator over all possible models. However, we normally do not need to explicitly compute $p({\bf s})$
  • For the parametric model ${\bf f} = {\bf Z} {\bf w}$, the previous expressions become:

    $$p({\bf w}|{\bf s}) = \frac{p({\bf s}|{\bf w})~p({\bf w})}{p({\bf s})}$$

    Where:

    • $p({\bf s}|{\bf w})$: is the likelihood function
    • $p({\bf w})$: is the prior distribution of the weights (assumptions are needed here)
    • $p({\bf s})$: is the marginal distribution of the observed data, which could be obtained integrating

3.2.2 Maximum likelihood vs Bayesian Inference. Making predictions

  • Following a ML approach, we retain a single model, ${\bf w}_{ML} = \arg \max_{\bf w} p({\bf s}|{\bf w})$. Then, the predictive distribution of the target value for a new point would be obtained as:

    $$p({s^*}|{\bf w}_{ML},{\bf x}^*) $$

    For the generative model of Section 3.1.2 (additive i.i.d. Gaussian noise), this distribution is:

    $$p({s^*}|{\bf w}_{ML},{\bf x}^*) = \frac{1}{\sqrt{2\pi\sigma_\varepsilon^2}} \exp \left(-\frac{\left(s^* - {\bf w}_{ML}^\top {\bf z}^*\right)^2}{2 \sigma_\varepsilon^2} \right)$$

    • The mean of $s^*$ is just the same as the prediction of the LS model, and the same uncertainty is assumed independently of the observation vector (i.e., the variance of the noise of the model).

    • If a single value is to be kept, we would probably keep the mean of the distribution, which is equivalent to the LS prediction.

  • Using Bayesian inference, we retain all models. Then, the inference of the value $s^* = s({\bf x}^*)$ is carried out by mixing all models, according to the weights given by the posterior distribution.

    \begin{align}p({s^}|{\bf x}^,{\bf s}) & = \int p({s^},{\bf w}~|~{\bf x}^,{\bf s}) d{\bf w} \

                    & = \int p({s^*}~|~{\bf w},{\bf x}^*,{\bf s}) p({\bf w}~|~{\bf x}^*,{\bf s}) d{\bf w} \\
                    & = \int p({s^*}~|~{\bf w},{\bf x}^*) p({\bf w}~|~{\bf s}) d{\bf w}\end{align}
    
    

    where:

    • $p({s^*}|{\bf w},{\bf x}^*) = \displaystyle\frac{1}{\sqrt{2\pi\sigma_\varepsilon^2}} \exp \left(-\frac{\left(s^* - {\bf w}^\top {\bf z}^*\right)^2}{2 \sigma_\varepsilon^2} \right)$
    • $p({\bf w}~|~{\bf s})$: Is the posterior distribution of the weights, that can be computed using Bayes' Theorem.

 3.2.3 Example: Selecting a Gaussian prior for the weights

 Prior distribution of the weights

In this section, we consider a particular example in which we assume the following prior for the weights:

$${\bf w} \sim {\cal N}\left({\bf 0},{\pmb \Sigma}_{p} \right)$$

The following figure shows functions which are generated by drawing points from this distribution


In [4]:
n_points = 15
n_grid = 200
frec = 3
std_n = 0.2
degree = 12
nplots = 6

#Prior distribution parameters
sigma_eps = 0.1
mean_w = np.zeros((degree+1,))
sigma_w = 0.3
var_w = sigma_w * np.eye(degree+1)

X_tr = 3 * np.random.random((n_points,1)) - 0.5
S_tr = - np.cos(frec*X_tr) + std_n * np.random.randn(n_points,1)
X_grid = np.linspace(-.5,2.5,n_grid)
S_grid = - np.cos(frec*X_grid) #Noise free for the true model

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(X_tr,S_tr,'b.',markersize=10)

for k in range(nplots):
    
    #Draw weigths fromt the prior distribution
    w_iter = np.random.multivariate_normal(mean_w,var_w)
    S_grid_iter = np.polyval(w_iter,X_grid)
    ax.plot(X_grid,S_grid_iter,'g-')

ax.set_xlim(-.5,2.5)
ax.set_ylim(S_16[0]-2,S_16[-1]+2)


Out[4]:
(-2.9673122915019876, 2.8954061213924756)

Likelihood of the weights

According to the generative model, ${\bf s} = {\bf Z}{\bf w} + {\pmb \varepsilon}$

$${\bf s}~|~{\bf w} \sim {\cal N}\left({\bf Z}{\bf w},\sigma_\varepsilon^2 {\bf I} \right)$$

Posterior distribution of the weights

$$p({\bf w}|{\bf s}) = \frac{p({\bf s}|{\bf w})~p({\bf w})}{p({\bf s})}$$

Since both $p({\bf s}|{\bf w})$ and $p({\bf w})$ follow a Gaussian distribution, we know also that the joint distribution and the posterior distribution of ${\bf w}$ given ${\bf s}$ are also Gaussian. Therefore,

$${\bf w}~|~{\bf s} \sim {\cal N}\left({\bar{\bf w}} , {\pmb\Sigma}_{\bf w}\right)$$

where the mean and the covariance matrix of the distribution are to be determined.

Exercise:

Show that the posterior mean and posterior covariance matrix of ${\bf w}$ given ${\bf s}$ are:

$${\pmb\Sigma}_{\bf w} = \left[\frac{1}{\sigma_\varepsilon^2} {\bf Z}^{\top}{\bf Z} + {\pmb \Sigma}_p^{-1}\right]^{-1}$$$${\bar{\bf w}} = {\sigma_\varepsilon^{-2}} {\pmb\Sigma}_{\bf w} {\bf Z}^\top {\bf s}$$

The following fragment of code draws random vectors from $p({\bf w}|{\bf s})$, and plots the corresponding regression curve along with the training points. Compare these curves with those extracted from the prior distribution of ${\bf w}$.


In [46]:
n_points = 15
n_grid = 200
frec = 3
std_n = 0.2
degree = 12
nplots = 6

#Prior distribution parameters
sigma_eps = 0.1
mean_w = np.zeros((degree+1,))
sigma_p = .3 * np.eye(degree+1)

X_tr = 3 * np.random.random((n_points,1)) - 0.5
S_tr = - np.cos(frec*X_tr) + std_n * np.random.randn(n_points,1)
X_grid = np.linspace(-.5,2.5,n_grid)
S_grid = - np.cos(frec*X_grid) #Noise free for the true model

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(X_tr,S_tr,'b.',markersize=10)

#Compute matrix with training input data for the polynomial model
Z = []
for x_val in X_tr.tolist():
    Z.append([x_val[0]**k for k in range(degree+1)])
Z=np.asmatrix(Z)

#Compute posterior distribution parameters
Sigma_w = np.linalg.inv(np.dot(Z.T,Z)/(sigma_eps**2) + np.linalg.inv(sigma_p))
posterior_mean = Sigma_w.dot(Z.T).dot(S_tr)/(sigma_eps**2)
posterior_mean = np.array(posterior_mean).flatten()

for k in range(nplots):
    
    #Draw weights from the posterior distribution
    w_iter = np.random.multivariate_normal(posterior_mean,Sigma_w)
    #Note that polyval assumes the first element of weight vector is the coefficient of
    #the highest degree term. Thus, we need to reverse w_iter
    S_grid_iter = np.polyval(w_iter[::-1],X_grid)
    ax.plot(X_grid,S_grid_iter,'g-')

#We plot also the least square solution
w_LS = np.polyfit(X_tr.flatten(), S_tr.flatten(), degree)
S_grid_iter = np.polyval(w_LS,X_grid)
ax.plot(X_grid,S_grid_iter,'m-',label='LS regression')
    
ax.set_xlim(-.5,2.5)
ax.set_ylim(S_16[0]-2,S_16[-1]+2)
ax.legend(loc='best')


Out[46]:
<matplotlib.legend.Legend at 0x10d098e90>

Posterior distribution of the target

  • Since $f^* = f({\bf x}^*) = [{\bf x}^*]^\top {bf w}$, $f^*$ is also a Gaussian variable whose posterior mean and variance can be calculated as follows:

    $$\mathbb{E}\{{{\bf x}^*}^\top {\bf w}~|~{\bf s}, {\bf x}^*\} = {{\bf x}^*}^\top \mathbb{E}\{{\bf w}|{\bf s}\} = {\sigma_\varepsilon^{-2}} {{\bf x}^*}^\top {\pmb\Sigma}_{\bf w} {\bf Z}^\top {\bf s}$$

    $$\text{Cov}\left[{{\bf x}^*}^\top {\bf w}~|~{\bf s}, {\bf x}^*\right] = {{\bf x}^*}^\top \text{Cov}\left[{\bf w}~|~{\bf s}\right] {{\bf x}^*} = {{\bf x}^*}^\top {\pmb \Sigma}_{\bf w} {{\bf x}^*}$$

  • Therefore, $f^*~|~{\bf s}, {\bf x}^* \sim {\cal N}\left({\sigma_\varepsilon^{-2}} {{\bf x}^*}^\top {\pmb\Sigma}_{\bf w} {\bf Z}^\top {\bf s}, {{\bf x}^*}^\top {\pmb \Sigma}_{\bf w} {{\bf x}^*} \right)$

  • Finally, for $s^* = f^* + \varepsilon^*$, the posterior distribution is $s^*~|~{\bf s}, {\bf x}^* \sim {\cal N}\left({\sigma_\varepsilon^{-2}} {{\bf x}^*}^\top {\pmb\Sigma}_{\bf w} {\bf Z}^\top {\bf s}, {{\bf x}^*}^\top {\pmb \Sigma}_{\bf w} {{\bf x}^*} + \sigma_\varepsilon^2\right)$


In [76]:
n_points = 15
n_grid = 200
frec = 3
std_n = 0.2
degree = 12
nplots = 6

#Prior distribution parameters
sigma_eps = 0.1
mean_w = np.zeros((degree+1,))
sigma_p = .5 * np.eye(degree+1)

X_tr = 3 * np.random.random((n_points,1)) - 0.5
S_tr = - np.cos(frec*X_tr) + std_n * np.random.randn(n_points,1)
X_grid = np.linspace(-.5,2.5,n_grid)
S_grid = - np.cos(frec*X_grid) #Noise free for the true model

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(X_tr,S_tr,'b.',markersize=10)

#Compute matrix with training input data for the polynomial model
Z = []
for x_val in X_tr.tolist():
    Z.append([x_val[0]**k for k in range(degree+1)])
Z=np.asmatrix(Z)

#Compute posterior distribution parameters
Sigma_w = np.linalg.inv(np.dot(Z.T,Z)/(sigma_eps**2) + np.linalg.inv(sigma_p))
posterior_mean = Sigma_w.dot(Z.T).dot(S_tr)/(sigma_eps**2)
posterior_mean = np.array(posterior_mean).flatten()

#Plot the posterior mean
#Note that polyval assumes the first element of weight vector is the coefficient of
#the highest degree term. Thus, we need to reverse w_iter
S_grid_iter = np.polyval(posterior_mean[::-1],X_grid)
ax.plot(X_grid,S_grid_iter,'g-',label='Predictive mean, BI')

#Plot confidence intervals for the Bayesian Inference
std_x = []
for el in X_grid:
    x_ast = np.array([el**k for k in range(degree+1)])
    std_x.append(np.sqrt(x_ast.dot(Sigma_w).dot(x_ast)[0,0]))
std_x = np.array(std_x)
plt.fill_between(X_grid, S_grid_iter-std_x, S_grid_iter+std_x,
    alpha=0.2, edgecolor='#1B2ACC', facecolor='#089FFF',
    linewidth=4, linestyle='dashdot', antialiased=True)

#We plot also the least square solution
w_LS = np.polyfit(X_tr.flatten(), S_tr.flatten(), degree)
S_grid_iter = np.polyval(w_LS,X_grid)
ax.plot(X_grid,S_grid_iter,'m-',label='LS regression')
    
ax.set_xlim(-.5,2.5)
ax.set_ylim(S_16[0]-2,S_16[-1]+2)
ax.legend(loc='best')


Out[76]:
<matplotlib.legend.Legend at 0x10f80ffd0>

Not only do we obtain a better predictive model, but we also have confidence intervals (error bars) for the predictions.


In [ ]: