Changes: v.1.0 - First version, expanding some cells from the Bayesian Regression
notebook
v.1.1 - Python 3 version.
Pending changes: * Include regression on the stock data
In [ ]:
# Import some libraries that will be necessary for working with data and displaying plots
# To visualize plots in the notebook
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.io # To read matlab files
import pylab
Given an observation vector ${\bf x}$, the goal of the regression problem is to find a function $f({\bf x})$ providing good predictions about some unknown variable $s$. To do so, we assume that a set of labelled training examples, $\{{\bf x}^{(k)}, s^{(k)}\}_{k=1}^K$ is available.
The predictor function should make good predictions for new observations ${\bf x}$ not used during training. In practice, this is tested using a second set (the test set) of labelled samples.
Most regression algorithms are grounded on the idea that all samples from the training and test sets have been generated independently by some common stochastic process. This is the reason why a model adjusted with the training data can make good predictions over new test samples.
Mathematically, this means that all pairs $({\bf x}^{(k)}, s^{(k)})$ from the training and test sets are independent and identically distributed (i.i.d.) samples from some distribution $p_{{\bf X}, S}({\bf x}, s)$. Unfortunately, this distribution is generally unknown.
<img src="figs/DataModel.png", width=180>
NOTE: In the following, we will use capital letters, ${\bf X}$, $S$, ..., to denote random variables, and lower-case letters ${\bf x}$, s, ..., to the denote the values they can take. When there is no ambigüity, we will remove subindices of the density functions, $p_{{\bf X}, S}({\bf x}, s)= p({\bf x}, s)$ to simplify the mathematical notation.
If $p({\bf x}, s)$ were know, we could apply estimation theory to estimate $s$ from $p$. For instance, we could apply any of the following classical estimates:
Note that, since these estimators depend on $p(s |{\bf x})$, knowing the posterior distribution of the target variable is enough, and we do not need to know the joint distribution.
Model based-regression methods exploit the idea of using the training data to estimate the posterior distribution $p(s|{\bf x})$ and then apply estimation theory to make predictions.
<img src="figs/ModelBasedReg.png", width=280>
How can we estimate the posterior probability function of the target variable $s$? In this section we will explore a parametric estimation method: let us assume that $p$ belongs to a parametric family of distributions $p(s|{\bf x},{\bf w})$, where ${\bf w}$ is some unknown parameter. We will use the training data to estimate ${\bf w}$
<img src="figs/ParametricReg.png", width=300>
The estimation of ${\bf w}$ from a given dataset $\mathcal{D}$ is the goal of the following sections
The ML (Maximum Likelihood) principle is well-known in statistics and can be stated as follows: take the value of the parameter to be estimated (in our case, ${\bf w}$) that best explains the given observations (in our case, the training dataset $\mathcal{D}$). Mathematically, this can be expressed as follows: $$ \hat{\bf w}_{\text{ML}} = \arg \max_{\bf w} p(\mathcal{D}|{\bf w}) $$
To be more specific: let us group the target variables into a vector $$ {\bf s} = \left(s^{(1)}, \dots, s^{(K)}\right)^\top $$ and the input vectors into a matrix $$ {\bf X} = \left({\bf x}^{(1)}, \dots, {\bf x}^{(K)}\right)^\top $$
To compute the likelihood function we will assume that the inputs do not depend on ${\bf w}$ (i.e., $p({\bf x}|{\bf w}) = p({\bf x})$. This means that ${\bf w}$ is a parameter of the posterior distribution of $s$ and not a parameter of the marginal distribution.
Then we can write
$$p(\mathcal{D}|{\bf w}) = p({\bf s}, {\bf X}|{\bf w}) = p({\bf s} | {\bf X}, {\bf w}) p({\bf X}|{\bf w}) = p({\bf s} | {\bf X}, {\bf w}) p({\bf X}) $$and we can express the estimation problem as the computation of $$ \hat{\bf w}_{\text{ML}} = \arg \max_{\bf w} p({\bf s}|{\bf X},{\bf w}) $$
Let's summarize what we need to do in order to design a regression algorithm:
Let us assume that the target variables $s^{(k)}$ in dataset $\mathcal{D}$ are given by $$ s^{(k)} = {\bf w}^\top {\bf z}^{(k)} + \varepsilon^{(k)} $$
where ${\bf z}^{(k)}$ is the result of some transformation of the inputs, ${\bf z}^{(k)} = T({\bf x}^{(k)})$, and $\varepsilon^{(k)}$ are i.i.d. instances of a Gaussian random variable with mean zero and varianze $\sigma_\varepsilon^2$, i.e., $$ p_E(\varepsilon) = \frac{1}{\sqrt{2\pi}\sigma_\varepsilon} \exp\left(-\frac{\varepsilon^2}{2\sigma_\varepsilon^2}\right) $$
Assuming that the noise variables are independent on ${\bf x}$ and ${\bf w}$, then, for a given ${\bf x}$ and ${\bf w}$, the target variable is gaussian with mean ${\bf w}^\top {\bf z}^{(k)}$ and variance $\varepsilon^2$ $$ p(s|{\bf x}, {\bf w}) = p_E(s-{\bf w}^\top{\bf z}) = \frac{1}{\sqrt{2\pi}\sigma_\varepsilon} \exp\left(-\frac{(s-{\bf w}^\top{\bf z})^2}{2\sigma_\varepsilon^2}\right) $$
Now we need to compute the likelihood function $p({\bf s}, {\bf X} | {\bf w})$. If the samples are i.i.d. we can write $$ p({\bf s}| {\bf X}, {\bf w}) = \prod_{k=1}^{K} p(s^{(k)}| {\bf x}^{(k)}, {\bf w}) = \prod_{k=1}^{K} \frac{1}{\sqrt{2\pi}\sigma_\varepsilon} \exp\left(-\frac{\left(s^{(k)}-{\bf w}^\top{\bf z}^{(k)}\right)^2}{2\sigma_\varepsilon^2}\right) \\ = \left(\frac{1}{\sqrt{2\pi}\sigma_\varepsilon}\right)^K \exp\left(-\sum_{k=1}^K \frac{\left(s^{(k)}-{\bf w}^\top{\bf z}^{(k)}\right)^2}{2\sigma_\varepsilon^2}\right) \\ $$ Finally, grouping variables ${\bf z}^{(k)}$ in $${\bf Z} = \left({\bf z}^{(1)}, \dots, {\bf z}^{(K)}\right)^\top$$ we get $$ p({\bf s}| {\bf X}, {\bf w}) = \left(\frac{1}{\sqrt{2\pi}\sigma_\varepsilon}\right)^K \exp\left(-\frac{1}{2\sigma_\varepsilon^2}\|{\bf s}-{\bf Z}{\bf w}\|^2\right) $$
Note that this is exactly the same optimization problem of the Least Squares (LS) regression algorithm. The solution is $$ {\bf w}_{ML} = ({\bf Z}^\top{\bf Z})^{-1}{\bf Z}^\top{\bf s} $$
The last step consists on computing an estimate of $s$ by assuming that the true value of the weight parameters is ${\bf w}_\text{ML}$. In particular, the minimum MSE estimate is $$ \hat{s}_\text{MSE} = \mathbb{E}\{s|{\bf x},{\bf w}_\text{ML}\} $$ Knowing that, given ${\bf x}$ and ${\bf w}$, $s$ is normally distributed with mean ${\bf w}^\top {\bf z}$ we can write $$ \hat{s}_\text{MSE} = {\bf w}_\text{ML}^\top {\bf z} $$
In [ ]:
X = np.array([0.15, 0.41, 0.53, 0.80, 0.89, 0.92, 0.95])
s = np.array([0.09, 0.16, 0.63, 0.44, 0.55, 0.82, 0.95])
have been generated by a linear Gaussian model (i.e., with $z = T(x) = x$) with noise variance
In [ ]:
sigma_eps = 0.3
In [ ]:
# <SOL>
# </SOL>
In [ ]:
# Note that, to use lstsq, the input matrix must be K x 1
Xcol = X[:,np.newaxis]
# Compute the ML estimate using linalg.lstsq from Numpy.
# wML = <FILL IN>
In [ ]:
K = len(s)
wGrid = np.arange(-0.5, 2, 0.01)
p = []
for w in wGrid:
d = s - X*w
# p.append(<FILL IN>)
# Compute the likelihood for the ML parameter wML
# d = <FILL IN>
# pML = [<FILL IN>]
# Plot the likelihood function and the optimal value
plt.figure()
plt.plot(wGrid, p)
plt.stem(wML, pML)
plt.xlabel('$w$')
plt.ylabel('Likelihood function')
plt.show()
In [ ]:
x = np.arange(0, 1.2, 0.01)
# sML = <FILL IN>
plt.figure()
plt.scatter(X, s)
# plt.plot(<FILL IN>)
plt.xlabel('x')
plt.ylabel('s')
plt.axis('tight')
plt.show()
Solution:
In [ ]:
K = len(s)
wGrid = np.arange(0, 6, 0.01)
p = []
Px = np.prod(X)
xs = np.dot(X,s)
for w in wGrid:
# p.append(<FILL IN>)
plt.figure()
# plt.plot(<FILL IN>)
plt.xlabel('$w$')
plt.ylabel('Likelihood function')
plt.show()
Solution:
In [ ]:
# wML = <FILL IN>
print(wML)
With an additive Gaussian independent noise model, the maximum likelihood and the least squares solutions are the same. We have not improved much ...
However, we have already formulated the problem in a probabilistic way. This opens the door to reasoning in terms of a set of possible explanations, not just one. We believe more than one of our models could have generated the data.
In [ ]: