Suppose we have a set of observed data $D$ and we want to evaluate a parameter setting $w$: $$p(w|D) = \frac{p(D|w)p(w)}{p(D)}$$ $$p(D) = \sum_{w}{p(D|w)p(w)}$$
We call $p(D|w)$ as the likelihood function. Then we have $p(w|D) \propto p(D|w)p(w)$. Suppose $p(w)$ is the same for all $w$, we can only choose $w$ to maximize likelihood $p(D|w)$, which is to maximize the log-likelihood $\log{p(D|w)}$.
We have observed data $x_1, \cdots, x_n$ drawn from Bernoulli distribution: $$p(x) = \begin{cases} \theta & \quad \text{if } x = 1\\ 1 - \theta & \quad \text{if } x = 0\\ \end{cases}$$
(a) What is the likelihood function based on $\theta$?
(b) What is the log-likelihood function?
(c) Compute estimated $\theta$ to maximize the log-likelihood function.
Suppose we have a set of observed data $D$ and we want to evaluate a parameter setting $w$: $$p(w|D) = \frac{p(D|w)p(w)}{p(D)}$$ $$p(D) = \sum_{w}{p(D|w)p(w)}$$
We call $p(D|w)$ as the likelihood function. Then we have $p(w|D) \propto p(D|w)p(w)$. Suppose $p(w)$ is the same for all $w$, we can only choose $w$ to maximize likelihood $p(D|w)$, which is to maximize the log-likelihood $\log{p(D|w)}$.
(a) $$ \begin{array}{ll} L(\theta) &= p(D|\theta) = p(x_1, \dots, x_n|\theta) \\ &= \prod_{i}{p(x_i)} = \theta^{\sum{\mathbb{1}(x_i = 1)}}(1 - \theta)^{\sum{\mathbb{1}(x_i = 0)}} \\ &= \theta^{k}(1 - \theta)^{n - k} \end{array} $$ where $k$ is the number of $1$s from the observed data.
(b) $$\log{L(\theta)} = k\log(\theta) + (n - k)\log(1 - \theta)$$
We have observed data $x_1, \cdots, x_n$ drawn from Normal distribution: $$\mathcal{N}(x|\mu, \sigma^2) = \frac{1}{(2\pi \sigma^2)^\frac{1}{2}} \exp{(-\frac{1}{2\sigma^2}(x - \mu)^2)}$$
(a) What is the likelihood function based on $\mu$ and $\sigma^2$?
(b) What is the log-likelihood function?
(c) Compute estimated parameters $\mu$ and $\sigma^2$ to maximize the log-likelihood function.
We have observed data $x_1, \cdots, x_n$ drawn from Normal distribution: $$\mathcal{N}(x|\mu, \sigma^2) = \frac{1}{(2\pi \sigma^2)^\frac{1}{2}} \exp{(-\frac{1}{2\sigma^2}(x - \mu)^2)}$$
(a) What is the log-likelihood function?
Answer: $-(n/2)\log \sigma - \sum_{i=1}^n\frac{1}{2\sigma^2}(x_i - \mu)^2$
(b) Compute estimated parameters $\mu$ and $\sigma^2$ to maximize the log-likelihood function.
Answer:
Recall the objective function we minimizes in last lecture is $$ E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 $$
To penalize the large coefficients, we will add one penalization/regularization term to it and minimize them altogether. $$ E(\vec{w}) = \underbrace{ \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 }_{E_D(\vec{w})}+ \underbrace{\boxed{\frac{\lambda}{2} \left \| \vec{w} \right \|^2}}_{E_W(\vec{w})} $$ of which $E_D(\vec{w})$ represents the term of sum of squared errors and $E_W(\vec{w})$ is the regularization term.
$\lambda$ is the regularization coefficient.
Exercise: Derive the gradient in element-wise to verify the above result, i.e. using $\phi(\vec{x}_n)_d$ and $w_d$ to represent $E(w_1, w_2, \dots, w_D)$ and derive $\frac{\partial E}{\partial w_d}$. Suppose $\phi(\vec{x_n}) \in \mathbb{R}^D$.
Based on what we have derived in last lecture, we could write the objective function as $$ \begin{aligned} E(\vec{w}) = \frac{1}{2}\sum_{n = 1}^{N}{(\sum_{d=1}^{D}{w_d\phi_d(\vec{x}_n)} - t_n)^2} + \frac{\lambda}{2}\sum_{d=1}^{D}{w_d^2} \\ \frac{\partial E}{\partial w_d} = \sum_{n = 1}^{N}{\phi_d(\vec{x}_n)(\sum_{d=1}^{D}{w_d\phi_d(\vec{x}_n)} - t_n)} + \lambda w_d \\ \frac{\partial E}{\partial w_d} = \sum_{n = 1}^{N}{\phi_d(\vec{x}_n)(\vec{w}^T\phi(\vec{x}_n) - t_n)} + \lambda w_d \end{aligned} $$
The gradient is $$ \begin{aligned} \nabla_{\vec{w}} E(\vec{w}) &= \Phi^T \Phi \vec{w} - \Phi^T \vec{t} + \lambda \vec{w}\\ &= (\Phi^T \Phi + \lambda I)\vec{w} - \Phi^T \vec{t} \end{aligned} $$
Setting the gradient to 0, we will get the solution $$ \boxed{ \hat{\vec{w}}=(\Phi^T \Phi + \lambda I)^{-1} \Phi^T \vec{t} } $$
In the solution to ordinary least squares which is $\hat{\vec{w} }=(\Phi^T \Phi)^{-1} \Phi^T \vec{t}$, we cannot guarantee $\Phi^T \Phi$ is invertible. But in regularized least squares, if $\lambda > 0$, $\Phi^T \Phi + \lambda I$ is always invertible.
Exercise: To be invertible, a matrix needs to be full rank. Argue that $\Phi^T \Phi + \lambda I$ is full rank by characterizing its $p$ eigenvalues in terms of the singular values of $\Phi$ and $\lambda$.
Suppose $\Phi = U^T\Lambda V$ which is SVD of $\Phi$, we have $\Phi^T\Phi = V^T\Lambda^2V$.
Then we have $(\Phi^T\Phi + \lambda I)V^T = V^T(\Lambda^2 + \lambda I)$.
The $i^{th}$ eigenvalue of $(\Phi^T\Phi + \lambda I)$ is $\lambda_i^2 + \lambda > 0$ where $\lambda_i$ is the singular value for $\Phi$.
Then $\det{(\Phi^T\Phi + \lambda I)} = \prod{(\lambda_i^2 + \lambda)} > 0$, which means $\Phi^T\Phi + \lambda I$ is invertable.
The $\ell^p$ norm of a vector $\vec{x}$ is defined as $$ \left \| \vec{x} \right \|_p = (\sum_{j=1}^{M} |x_j|^p)^\frac{1}{p} $$
For the regularized least squares above, we used $\ell^2$ norm. We could also use other $\ell^p$ norms for different regularizers and the objective function becomes $$ E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 + \frac{\lambda}{2} \left \| \vec{w} \right \|_p^p $$
Exercise: Derive the element-wise gradient for the above $\ell^p$ norm regularized energy function.
Gaussian Distribution $$ \mathcal{N}(x, \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[ \frac{(x-\mu)^2}{2\sigma^2} \right] $$
Maximum Likelihood Estimation and Maximum a Posteriori Estimation (MAP)
We assume the signal+noise model of single data $(\vec{x}, t)$ is $$ \begin{gathered} t = \vec{w}^T \phi(\vec{x}) + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \end{gathered} $$ of which $\vec{w}^T \phi(\vec{x})$ is the true model, $\epsilon$ is the perturbation/randomness.
Since $\vec{w}^T \phi(\vec{x})$ is deterministic/non-random, we have $$ t \sim \mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1}) $$
Exercise:
The likelihood function of $t$ is just probability density function (PDF) of $t$ $$ p(t|\vec{x},\vec{w},\beta) = \mathcal{N}(t|\vec{w}^T \phi(\vec{x}),\beta^{-1}) $$
For inputs $\mathcal{X}=(\vec{x}_1, \dots, \vec{x}_n)$ and target values $\vec{t}=(t_1,\dots,t_n)$, the data likelihood is $$ p(\vec{t}|\mathcal{X},\vec{w},\beta) = \prod_{n=1}^N p(t_n|\vec{x}_n,\vec{w},\beta) = \prod_{n=1}^N \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) $$
Notation Clarification
Single data likelihood is $$ p(t_n|\vec{x}_n,\vec{w},\beta) = \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) = \frac{1}{\sqrt{2 \pi \beta^{-1}}} \exp \left \{ - \frac{1}{2 \beta^{-1}} (t_n - \vec{w}^T \phi(x_n))^2 \right \} $$
Single data log-likelihood is $$ \ln p(t_n|\vec{x}_n,\vec{w},\beta) = - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 $$ We use logarithm because maximizer of $f(x)$ is the same as maximizer of $\log f(x)$. Logarithm can convert product to summation which makes life easier.
Complete data log-likelohood is $$ \begin{aligned} \ln p(\vec{t}|\mathcal{X},\vec{w},\beta) &= \ln \left[ \prod_{n=1}^N p(t_n|\vec{x}_n,\vec{w},\beta) \right] = \sum_{n=1}^N \ln p(t_n|\vec{x}_n,\vec{w},\beta) \\ &= \sum_{n=1}^N \left[ - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \end{aligned} $$
Maximum likelihood estimate $\vec{w}_{ML}$ is $$ \begin{aligned} \vec{w}_{ML} &= \underset{\vec{w}}{\arg \max} \ln p(\vec{t}|\mathcal{X},\vec{w},\beta) \\ &= \underset{\vec{w}}{\arg \max} \sum_{n=1}^N \left[ - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \\ &= \underset{\vec{w}}{\arg \max} \sum_{n=1}^N \left[ - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \\ &= \underset{\vec{w}}{\arg \min} \sum_{n=1}^N \left[(\vec{w}^T \phi(x_n) - t_n)^2 \right] \end{aligned} $$
Familiar? Recall the objective function we minimized in least squares is $E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2$, so we could conclude that $$ \boxed{\vec{w}_{ML} = \hat{\vec{w}}_{LS} = \Phi^\dagger \vec{t}} $$
Exercise: Derive the MAP Estimator of $\vec{w}$ and compare the solution with regularized linear regression. What is $\lambda$ in this case?
$\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$ is multivariate Gaussian which has PDF $$ p(\vec{w}) = \frac{1}{\left( \sqrt{2 \pi \alpha^{-1}} \right)^N} \exp \left \{ -\frac{1}{2 \alpha^{-1}} \sum_{n=1}^N w_n^2 \right \} $$
So the MAP estimator is $$ \begin{aligned} \vec{w}_{MAP} &= \underset{\vec{w}}{\arg \max} \ p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) = \underset{\vec{w}}{\arg \max} \left[\ln p(\vec{t}|\vec{w}, \mathcal{X},\beta) + \ln p(\vec{w}) \right] \\ &= \underset{\vec{w}}{\arg \min} \left[ \sum_{n=1}^N \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 + \frac{\alpha}{2} \sum_{n=1}^N w_n^2 \right] \\ &= \underset{\vec{w}}{\arg \min} \left[ \sum_{n=1}^N \frac12 (\vec{w}^T \phi(x_n) - t_n)^2 + \frac12 \frac{\alpha}{\beta} \left \| \vec{w} \right \|^2 \right] \end{aligned} $$
Exactly the objective in regularized least squares! So $$ \boxed{ \vec{w}_{MAP} = \hat{\vec{w}}=\left(\Phi^T \Phi + \frac{\alpha}{\beta} I\right)^{-1} \Phi^T \vec{t} } $$
Compared with $\ell^2$ norm regularized least square, we have $\lambda = \frac{\alpha}{\beta}$.
Assume we have $n$ vectors $\vec{x}_1, \cdots, \vec{x}_n$. We also assume that for each $\vec{x}_i$ we have observed a target value $t_i$, where $$ \begin{gather} t_i = \vec{w}^T \vec{x_i} + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \end{gather} $$ where $\epsilon$ is the "noise term".
(a) Quick quiz: what is the likelihood given $\vec{w}$? That is, what's $p(t_i | \vec{x}_i, \vec{w})$?
Answer: $p(t_i | \vec{x}_i, \vec{w}) = \mathcal{N}(t_i|\vec{w}^\top \vec{x_i}, \beta^{-1}) = \frac{1}{(2\pi \sigma^2)^\frac{1}{2}} \exp{(-\frac{\beta}{2}(t_i - \vec{w}^\top \vec{x_i})^2)}$
Assume we have $n$ vectors $\vec{x}_1, \cdots, \vec{x}_n$. We also assume that for each $\vec{x}_i$ we have observed a target value $t_i$, sampled IID. We will also put a prior on $\vec{w}$, using PSD matrix $\Sigma$. $$ \begin{gather} t_i = \vec{w}^T \vec{x_i} + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \\ \vec{w} \sim \mathcal{N}(0, \Sigma) \end{gather} $$ Note: the difference here is that our prior is a multivariate gaussian with non-identity covariance! Also we let $\mathcal{X} = \{\vec{x}_1, \cdots, \vec{x}_n\}$
(a) Compute the log posterior function, $\log p(\vec{w}|\vec{t}, \mathcal{X},\beta)$
Hint: use Bayes' Rule
(b) Compute the MAP estimate of $\vec{w}$ for this model
Hint: the solution is very similar to the MAP estimate for a gaussian prior with identity covariance