Changes: v.1.0 - First version, expanding some cells from the Bayesian Regression
notebook
v.1.1 - Python 3 version.
v.1.2 - Revised presentation.
v.1.3 - Updated index notation
Pending changes: * Include regression on the stock data
In [3]:
# Import some libraries that will be necessary for working with data and displaying plots
# To visualize plots in the notebook
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.io # To read matlab files
import pylab
In this notebook we will make extensive use of probability distributions. In general, we will use capital leters ${\bf X}$, $S$, $E$ ..., to denote random variables, and lower-case letters ${\bf x}$, $s$, $\epsilon$ ..., to denote the values they can take.
In general, we will use letter $p$ for probability density functions (pdf). When necessary, we will use, capital subindices to make the random variable explicit. For instance, $p_{{\bf X}, S}({\bf x}, s)$ would be the joint pdf of random variables ${\bf X}$ and $S$ at values ${\bf x}$ and $s$, respectively.
However, to avoid a notation overload, we will omit subindices when they are clear from the context. For instance, we will use $p({\bf x}, s)$ instead of $p_{{\bf X}, S}({\bf x}, s)$.
Given an observation vector ${\bf x}$, the goal of the regression problem is to find a function $f({\bf x})$ providing good predictions about some unknown variable $s$. To do so, we assume that a set of labelled training examples, $\{{\bf x}_k, s_k\}_{k=0}^{K-1}$ is available.
The predictor function should make good predictions for new observations ${\bf x}$ not used during training. In practice, this is tested using a second set (the test set) of labelled samples.
Many regression algorithms are grounded on the idea that all samples from the training set have been generated independently by some common stochastic process.
If $p({\bf x}, s)$ were known, we could apply estimation theory to estimate $s$ for a given ${\bf x}$ using $p$. For instance, we could apply any of the following classical estimates:
Note that, since these estimators depend on $p(s |{\bf x})$, knowing the posterior distribution of the target variable is enough, and we do not need to know the joint distribution $p({\bf x}, s)$.
More importantly, note that if we knew the underlying model, we would not need the data in ${\cal D}$ to make predictions on new data.
Assume the target variable $s$ is a scaled noisy version of the input variable $x$: $$ s = 2 x + \epsilon $$ where $\epsilon$ is Gaussian a noise variable with zero mean and unit variance, which does not depend on $x$.
Since the MSE estimate is the conditional mean, which has been already computed, have $\hat{s}_\text{MSE}= 2x$
The prediction is $\hat{s}_\text{MSE}= 2 \cdot 4 = 8$
Assume the target variable $s$ is a scaled noisy version of the input variable $x$: $$ s = w x + \epsilon $$ where $\epsilon$ is Gaussian a noise variable with zero mean and unit variance, which does not depend on $x$. Assume that $w$ is known.
We will use the training data to estimate ${\bf w}$
The estimation of ${\bf w}$ from a given dataset $\mathcal{D}$ is the goal of the following sections
The ML (Maximum Likelihood) principle is well-known in statistics and can be stated as follows: take the value of the parameter to be estimated (in our case, ${\bf w}$) that best explains the given observations (in our case, the training dataset $\mathcal{D}$). Mathematically, this can be expressed as follows: $$ \hat{\bf w}_{\text{ML}} = \arg \max_{\bf w} p(\mathcal{D}|{\bf w}) $$
All samples in dataset ${\cal D} = \{(x_k, s_k), k=0,\ldots,K-1 \}$ $$ s_k = w \cdot x_k + \epsilon_k $$ where $\epsilon_k$ are i.i.d. (independent and identically distributed) Gaussian noise random variables with zero mean and unit variance, which do not depend on $x_k$.
Compute the ML estimate, $\hat{w}_{\text{ML}}$, of $w$.
From Exercise 2, $$ p\left(s_k \mid x_k, w \right) = \frac{1}{\sqrt{2\pi}}\exp\left(-\frac12\left(s_k-wx_k\right)^2\right) $$
Since the noise variables are i.i.d, we have $$ p(\mathcal{D}|w) = p\left(s_0,\ldots, s_{K-1}\mid x_{0}, \ldots, x_{K-1}, w \right) p\left(x_0,\ldots, x_{K-1}\mid w \right)\\ \qquad \quad = \prod_{k=0}^{K-1} \frac{1}{\sqrt{2\pi}} \exp\left(-\frac12\left(s_k-wx_k\right)^2\right) p\left(x_0,\ldots, x_{K-1} \right) \\ \qquad \quad = \frac{1}{(2\pi)^{\frac{K}{2}}} \exp\left(-\frac12 \sum_{k=0}^{K-1} \left(s_k-wx_k\right)^2\right) p\left(x_0,\ldots, x_{K-1} \right) $$ Therefore: $$ \hat{\bf w}_{\text{ML}} = \arg \max_{\bf w} p(\mathcal{D}|{\bf w}) = \arg \min_{\bf w} \sum_{k=0}^{K-1} \left(s_k-wx_k\right)^2 $$
Differentiating with respect to w: $$
{\sum_{k=0}^{K-1} \left(x_k\right)^2}
$$
In [4]:
X = np.array([0.15, 0.41, 0.53, 0.80, 0.89, 0.92, 0.95])
s = np.array([0.09, 0.16, 0.63, 0.44, 0.55, 0.82, 0.95])
In [5]:
# <SOL>
plt.figure()
plt.scatter(X, s)
plt.xlabel('x')
plt.ylabel('s')
plt.show()
# </SOL>
In [7]:
# wML = <FILL IN>
wML = np.sum(X*s) / np.sum(X*X)
print("The ML estimate is {}".format(wML))
In [8]:
sigma_eps = 1
K = len(s)
wGrid = np.arange(-0.5, 2, 0.01)
p = []
for w in wGrid:
d = s - X*w
# p.append(<FILL IN>)
p.append((1.0/(np.sqrt(2*np.pi)*sigma_eps))**K * np.exp(-np.dot(d, d) / (2*sigma_eps**2)))
# Compute the likelihood for the ML parameter wML
# d = <FILL IN>
d = s-X*wML
# pML = [<FILL IN>]
pML = [(1.0/(np.sqrt(2*np.pi)*sigma_eps))**K * np.exp(-np.dot(d, d) / (2*sigma_eps**2))]
# Plot the likelihood function and the optimal value
plt.figure()
plt.plot(wGrid, p)
plt.stem([wML], pML)
plt.xlabel('$w$')
plt.ylabel('Likelihood function')
plt.show()
In [9]:
xgrid = np.arange(0, 1.2, 0.01)
# sML = <FILL IN>
sML = wML * xgrid
plt.figure()
plt.scatter(X, s)
# plt.plot(<FILL IN>)
plt.plot(xgrid, sML)
plt.xlabel('x')
plt.ylabel('s')
plt.axis('tight')
plt.show()
In order to solve exercise 4 we have taken advantage of the statistical independence of the noise components. Some independence assumptions are required in general to compute the ML estimate in other scenarios.
In order to estimate ${\bf w}$ from the training data in a mathematicaly rigorous and compact form let us group the target variables into a vector $$ {\bf s} = \left(s_0, \dots, s_{K-1}\right)^\top $$ and the input vectors into a matrix $$ {\bf X} = \left({\bf x}_0, \dots, {\bf x}_{K-1}\right)^\top $$
We will make the following assumptions:
Since ${\cal D} = ({\bf X}, {\bf s})$, we can write $$p(\mathcal{D}|{\bf w}) = p({\bf s}, {\bf X}|{\bf w}) = p({\bf s} | {\bf X}, {\bf w}) p({\bf X}|{\bf w}) $$ Using assumption A2, $$ p(\mathcal{D}|{\bf w}) = p({\bf s} | {\bf X}, {\bf w}) p({\bf X}) $$
and, finally, using assumption A3, we can express the estimation problem as the computation of
\begin{align} \hat{\bf w}_{\text{ML}} &= \arg \max_{\bf w} p({\bf s}|{\bf X},{\bf w}) \\ \qquad \quad &= \arg \max_{\bf w} \prod_{k=0}^{K-1} p(s_k \mid {\bf x}_k, {\bf w}) \\ \qquad \quad &= \arg \max_{\bf w} \sum_{k=0}^{K-1}\log p(s_k \mid {\bf x}_k, {\bf w}) \end{align}Any of the last three terms can be used to optimize ${\bf w}$. The sum in the last term is usually called the log-likelihood function, $L({\bf w})$, whereas the product in the previous line is simply referred as the likelihood function.
Let's summarize what we need to do in order to design a regression algorithm based on ML estimation:
Let us assume that the target variables $s_k$ in dataset $\mathcal{D}$ are given by $$ s_k = {\bf w}^\top {\bf z}_k + \varepsilon_k $$
where ${\bf z}_k$ is the result of some transformation of the inputs, ${\bf z}_k = T({\bf x}_k)$, and $\varepsilon_k$ are i.i.d. instances of a Gaussian random variable with mean zero and varianze $\sigma_\varepsilon^2$, i.e., $$ p_E(\varepsilon) = \frac{1}{\sqrt{2\pi}\sigma_\varepsilon} \exp\left(-\frac{\varepsilon^2}{2\sigma_\varepsilon^2}\right) $$
Assuming that the noise variables are independent of ${\bf x}$ and ${\bf w}$, then, for a given ${\bf x}$ and ${\bf w}$, the target variable is gaussian with mean ${\bf w}^\top {\bf z}_k$ and variance $\sigma_\varepsilon^2$ $$ p(s|{\bf x}, {\bf w}) = p_E(s-{\bf w}^\top{\bf z}) = \frac{1}{\sqrt{2\pi}\sigma_\varepsilon} \exp\left(-\frac{(s-{\bf w}^\top{\bf z})^2}{2\sigma_\varepsilon^2}\right) $$
Now we need to compute the likelihood function $p({\bf s}, {\bf X} | {\bf w})$. If the samples are i.i.d. we can write $$ p({\bf s}| {\bf X}, {\bf w}) = \prod_{k=0}^{K-1} p(s_k| {\bf x}_k, {\bf w}) = \prod_{k=0}^{K-1} \frac{1}{\sqrt{2\pi}\sigma_\varepsilon} \exp\left(-\frac{\left(s_k-{\bf w}^\top{\bf z}_k\right)^2}{2\sigma_\varepsilon^2}\right) \\ = \left(\frac{1}{\sqrt{2\pi}\sigma_\varepsilon}\right)^K \exp\left(-\sum_{k=1}^K \frac{\left(s_k-{\bf w}^\top{\bf z}_k\right)^2}{2\sigma_\varepsilon^2}\right) \\ $$
Finally, grouping variables ${\bf z}_k$ in $${\bf Z} = \left({\bf z}_0, \dots, {\bf z}_{K-1}\right)^\top$$ we get $$ p({\bf s}| {\bf X}, {\bf w}) = \left(\frac{1}{\sqrt{2\pi}\sigma_\varepsilon}\right)^K \exp\left(-\frac{1}{2\sigma_\varepsilon^2}\|{\bf s}-{\bf Z}{\bf w}\|^2\right) $$
The maximum likelihood solution is then given by: $$ {\bf w}_{ML} = \arg \max_{\bf w} p({\bf s}|{\bf w}) = \arg \min_{\bf w} \|{\bf s} - {\bf Z}{\bf w}\|^2 $$ Note that $\|{\bf s} - {\bf Z}{\bf w}\|^2$ is the sum or the squared prediction errors (Sum of Squared Errors, SSE) for all samples in the dataset. This is also called the Least Squares (LS) solution.
The LS solution can be easily computed by differentiation, $$ \nabla_{\bf w} \|{\bf s} - {\bf Z}{\bf w}\|^2\Bigg|_{{\bf w} = {\bf w}_\text{ML}} = - 2 {\bf Z}^\top{\bf s} + 2 {\bf Z}^\top{\bf Z} {\bf w}_{\text{ML}} = {\bf 0} $$ and it is equal to $$ {\bf w}_\text{ML} = ({\bf Z}^\top{\bf Z})^{-1}{\bf Z}^\top{\bf s} $$
The last step consists on computing an estimate of $s$ by assuming that the true value of the weight parameters is ${\bf w}_\text{ML}$. In particular, the minimum MSE estimate is $$ \hat{s}_\text{MSE} = \mathbb{E}\{s|{\bf x},{\bf w}_\text{ML}\} $$
Knowing that, given ${\bf x}$ and ${\bf w}$, $s$ is normally distributed with mean ${\bf w}^\top {\bf z}$ we can write $$ \hat{s}_\text{MSE} = {\bf w}_\text{ML}^\top {\bf z} $$
In [16]:
X = np.array([0.15, 0.41, 0.53, 0.80, 0.89, 0.92, 0.95])
s = np.array([0.09, 0.16, 0.63, 0.44, 0.55, 0.82, 0.95])
have been generated by the polynomial Gaussian model $$ s = w_0 + w_1 x + w_2 x^2 + \epsilon $$ (i.e., with ${\bf z} = T(x) = (1, x, x^2)^\intercal$) with noise variance
In [17]:
sigma_eps = 0.3
In [18]:
# Compute the extended input matrix Z
nx = len(X)
# Z = <FILL IN>
Z = np.hstack((np.ones((nx, 1)), X[:,np.newaxis], X[:,np.newaxis]**2))
# Compute the ML estimate using linalg.lstsq from Numpy.
# wML = <FILL IN>
wML = np.linalg.lstsq(Z, s)[0]
print(wML)
In [19]:
K = len(s)
# Compute the likelihood for the ML parameter wML
# d = <FILL IN>
d = s - np.dot(Z, wML)
# LwML = [<FILL IN>]
LwML = - K/2*np.log(2*np.pi*sigma_eps**2) - np.dot(d, d) / (2*sigma_eps**2)
print(LwML)
In [20]:
xgrid = np.arange(0, 1.2, 0.01)
nx = len(xgrid)
# Compute the input matrix for the grid data in x
# Z = <FILL IN>
Z = np.hstack((np.ones((nx, 1)), xgrid[:,np.newaxis], xgrid[:,np.newaxis]**2))
# sML = <FILL IN>
sML = np.dot(Z, wML)
plt.figure()
plt.scatter(X, s)
# plt.plot(<FILL IN>)
plt.plot(xgrid, sML)
plt.xlabel('x')
plt.ylabel('s')
plt.axis('tight')
plt.show()
Solution:
In [22]:
K = len(s)
wGrid = np.arange(0, 6, 0.01)
p = []
Px = np.prod(X)
xs = np.dot(X,s)
for w in wGrid:
# p.append(<FILL IN>)
p.append((w**K)*Px*np.exp(-w*xs))
plt.figure()
# plt.plot(<FILL IN>)
plt.plot(wGrid, p)
plt.xlabel('$w$')
plt.ylabel('Likelihood function')
plt.show()
(Hint: you can maximize the log-likelihood function instead of the likelihood function in order to simplify the differentiation)
Solution:
In [24]:
# wML = <FILL IN>
wML = np.float(K) /xs
print(wML)
Solution:
In [25]:
xgrid = np.arange(0.1, 1.2, 0.01)
# sML = <FILL IN>
sML = 1 / (wML * xgrid)
plt.figure()
plt.scatter(X, s)
# plt.plot(<FILL IN>)
plt.plot(xgrid, sML)
plt.xlabel('x')
plt.ylabel('s')
plt.axis('tight')
plt.show()
Subjectively, we can see that the predictor computed in exercise 6 does not fit the given data very well. This could be a false perception. If the data have been truly generated by the parametric model assumed in exercise 6 (i.e. $p(s \mid x, w) = w x \exp(- w x s)$, the apparent missbehavior of the estimator could be caused by the natural randomness of the data, and a greater amount of data would show a better adjustement.
Alternative, it may be the case the model assumed in sec. 6 is incorrect. Again, more data would be useful to asses that.
This shows that the choice of the data model is important. In many applications, no parametric data model is available, and the data scientist must make a choice based on the nature of the data or any previous knowledge about the statistical behavior of the data.
If no previous information is available, the data scientist can try different models, and compare using validation data and some cross validation technique.