$$ \text{LaTeX command declarations here.} \newcommand{\R}{\mathbb{R}} \renewcommand{\vec}[1]{\mathbf{#1}} $$

EECS 445: Machine Learning

Hands On 05: Linear Regression II

  • Instructor: Zhao Fu, Valli, Jacob Abernethy and Jia Deng
  • Date: September 26, 2016

Review: Maximum Likelihood

Suppose we have a set of observed data $D$ and we want to evaluate a parameter setting $w$: $$p(w|D) = \frac{p(D|w)p(w)}{p(D)}$$ $$p(D) = \sum_{w}{p(D|w)p(w)}$$

We call $p(D|w)$ as the likelihood function. Then we have $p(w|D) \propto p(D|w)p(w)$. Suppose $p(w)$ is the same for all $w$, we can only choose $w$ to maximize likelihood $p(D|w)$, which is to maximize the log-likelihood $\log{p(D|w)}$.

Review Problem: Maximum Likelihood Estimation

We have observed data $x_1, \cdots, x_n$ drawn from Bernoulli distribution: $$p(x) = \begin{cases} \theta & \quad \text{if } x = 1\\ 1 - \theta & \quad \text{if } x = 0\\ \end{cases}$$

(a) What is the likelihood function based on $\theta$?

(b) What is the log-likelihood function?

(c) Compute estimated $\theta$ to maximize the log-likelihood function.

Review: Maximum Likelihood

Suppose we have a set of observed data $D$ and we want to evaluate a parameter setting $w$: $$p(w|D) = \frac{p(D|w)p(w)}{p(D)}$$ $$p(D) = \sum_{w}{p(D|w)p(w)}$$

We call $p(D|w)$ as the likelihood function. Then we have $p(w|D) \propto p(D|w)p(w)$. Suppose $p(w)$ is the same for all $w$, we can only choose $w$ to maximize likelihood $p(D|w)$, which is to maximize the log-likelihood $\log{p(D|w)}$.

Solution 1: Maximum Likelihood Estimation

(a) $$ \begin{array}{ll} L(\theta) &= p(D|\theta) = p(x_1, \dots, x_n|\theta) \\ &= \prod_{i}{p(x_i)} = \theta^{\sum{\mathbb{1}(x_i = 1)}}(1 - \theta)^{\sum{\mathbb{1}(x_i = 0)}} \\ &= \theta^{k}(1 - \theta)^{n - k} \end{array} $$ where $k$ is the number of $1$s from the observed data.

(b) $$\log{L(\theta)} = k\log(\theta) + (n - k)\log(1 - \theta)$$

Solution 1: Maximum Likelihood Estimation

(c) Set the derivative of $log(L(\theta))$ to zero we have $$ \frac{\mathrm{d}log(L(\theta))}{\mathrm{d}\theta} = \frac{k}{\theta} - \frac{n - k}{1 - \theta} = 0 \\ \frac{k}{\theta} = \frac{n - k}{1 - \theta} \\ \theta = \frac{k}{n} $$.

Problem 2: Maximum Likelihood Estimation for Gaussian Distribution

We have observed data $x_1, \cdots, x_n$ drawn from Normal distribution: $$\mathcal{N}(x|\mu, \sigma^2) = \frac{1}{(2\pi \sigma^2)^\frac{1}{2}} \exp{(-\frac{1}{2\sigma^2}(x - \mu)^2)}$$

(a) What is the likelihood function based on $\mu$ and $\sigma^2$?

(b) What is the log-likelihood function?

(c) Compute estimated parameters $\mu$ and $\sigma^2$ to maximize the log-likelihood function.

Solution 2

We have observed data $x_1, \cdots, x_n$ drawn from Normal distribution: $$\mathcal{N}(x|\mu, \sigma^2) = \frac{1}{(2\pi \sigma^2)^\frac{1}{2}} \exp{(-\frac{1}{2\sigma^2}(x - \mu)^2)}$$

(a) What is the log-likelihood function?

Answer: $-(n/2)\log \sigma - \sum_{i=1}^n\frac{1}{2\sigma^2}(x_i - \mu)^2$

(b) Compute estimated parameters $\mu$ and $\sigma^2$ to maximize the log-likelihood function.

Answer:

  • $\mu_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n x_i$
  • $\sigma^2_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n (x_i - \mu_{\text{MLE}})^2$

Regularized Linear Regression

Regularized Least Squares: Objective Function

  • Recall the objective function we minimizes in last lecture is $$ E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 $$

  • To penalize the large coefficients, we will add one penalization/regularization term to it and minimize them altogether. $$ E(\vec{w}) = \underbrace{ \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 }_{E_D(\vec{w})}+ \underbrace{\boxed{\frac{\lambda}{2} \left \| \vec{w} \right \|^2}}_{E_W(\vec{w})} $$ of which $E_D(\vec{w})$ represents the term of sum of squared errors and $E_W(\vec{w})$ is the regularization term.

  • $\lambda$ is the regularization coefficient.

  • If $\lambda$ is large, $E_{\vec{W}}(\vec{w})$ will dominate the objective function. As a result we will focus more on minimizing $E_W(\vec{w})$ and the resulting solution $\vec{w}$ tends to have smaller norm and the $E_D(\vec{w})$ term will be larger.

Regularized Least Squares: Derivation

  • Based on what we have derived in last lecture, we could write the objective function as $$ \begin{aligned} E(\vec{w}) &= \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 + \frac{\lambda}{2} \left \| \vec{w} \right \|^2 \end{aligned} $$

Exercise: Derive the gradient in element-wise to verify the above result, i.e. using $\phi(\vec{x}_n)_d$ and $w_d$ to represent $E(w_1, w_2, \dots, w_D)$ and derive $\frac{\partial E}{\partial w_d}$. Suppose $\phi(\vec{x_n}) \in \mathbb{R}^D$.

Regularized Least Squares: Solution

  • Based on what we have derived in last lecture, we could write the objective function as $$ \begin{aligned} E(\vec{w}) = \frac{1}{2}\sum_{n = 1}^{N}{(\sum_{d=1}^{D}{w_d\phi_d(\vec{x}_n)} - t_n)^2} + \frac{\lambda}{2}\sum_{d=1}^{D}{w_d^2} \\ \frac{\partial E}{\partial w_d} = \sum_{n = 1}^{N}{\phi_d(\vec{x}_n)(\sum_{d=1}^{D}{w_d\phi_d(\vec{x}_n)} - t_n)} + \lambda w_d \\ \frac{\partial E}{\partial w_d} = \sum_{n = 1}^{N}{\phi_d(\vec{x}_n)(\vec{w}^T\phi(\vec{x}_n) - t_n)} + \lambda w_d \end{aligned} $$

  • The gradient is $$ \begin{aligned} \nabla_{\vec{w}} E(\vec{w}) &= \Phi^T \Phi \vec{w} - \Phi^T \vec{t} + \lambda \vec{w}\\ &= (\Phi^T \Phi + \lambda I)\vec{w} - \Phi^T \vec{t} \end{aligned} $$

  • Setting the gradient to 0, we will get the solution $$ \boxed{ \hat{\vec{w}}=(\Phi^T \Phi + \lambda I)^{-1} \Phi^T \vec{t} } $$

Regularized Least Squares: Closed Form

In the solution to ordinary least squares which is $\hat{\vec{w} }=(\Phi^T \Phi)^{-1} \Phi^T \vec{t}$, we cannot guarantee $\Phi^T \Phi$ is invertible. But in regularized least squares, if $\lambda > 0$, $\Phi^T \Phi + \lambda I$ is always invertible.

Exercise: To be invertible, a matrix needs to be full rank. Argue that $\Phi^T \Phi + \lambda I$ is full rank by characterizing its $p$ eigenvalues in terms of the singular values of $\Phi$ and $\lambda$.

Solution:

Suppose $\Phi = U^T\Lambda V$ which is SVD of $\Phi$, we have $\Phi^T\Phi = V^T\Lambda^2V$.

Then we have $(\Phi^T\Phi + \lambda I)V^T = V^T(\Lambda^2 + \lambda I)$.

The $i^{th}$ eigenvalue of $(\Phi^T\Phi + \lambda I)$ is $\lambda_i^2 + \lambda > 0$ where $\lambda_i$ is the singular value for $\Phi$.

Then $\det{(\Phi^T\Phi + \lambda I)} = \prod{(\lambda_i^2 + \lambda)} > 0$, which means $\Phi^T\Phi + \lambda I$ is invertable.

Regularized Least Squares: Different Norms

  • The $\ell^p$ norm of a vector $\vec{x}$ is defined as $$ \left \| \vec{x} \right \|_p = (\sum_{j=1}^{M} |x_j|^p)^\frac{1}{p} $$

  • For the regularized least squares above, we used $\ell^2$ norm. We could also use other $\ell^p$ norms for different regularizers and the objective function becomes $$ E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 + \frac{\lambda}{2} \left \| \vec{w} \right \|_p^p $$

Exercise: Derive the element-wise gradient for the above $\ell^p$ norm regularized energy function.

Regularized Least Squares: Summary

  • Simple modification of linear regression
  • $\ell^2$ Regularization controls the tradeoff between fitting error and complexity.
    • Small $\ell^2$ regularization results in complex models, but with risk of overfitting
    • Large $\ell^2$ regularization results in simple models, but with risk of underfitting
  • It is important to find an optimal regularization that balances between the two

Probablistic Interpretation of Least Squares Regression

  • We have showed derived the solution to least squares regression by minimizing objective function. Now we will provide a probablistic perspective. Specifically, we will show the solution to regular least squares is just the maximum likelihood estimate of $\vec{w}$ and the solution to regularized least squares is the Maximum a Posteriori estimate.

Some Background

  • Gaussian Distribution $$ \mathcal{N}(x, \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[ \frac{(x-\mu)^2}{2\sigma^2} \right] $$

  • Maximum Likelihood Estimation and Maximum a Posteriori Estimation (MAP)

    • For distribution $t \sim p(t|\theta)$. $\theta$ is some unknown parameter (like mean or variance) to be estimated.
    • Given observation $\vec{t} = (t_1, t_2, \dots, t_N)$,
      • The Maximum Likelihood Estimator is $$ \theta_{ML} = \arg \max \prod_{n=1}^N p(t_n | \theta) $$
      • If we have some prior knowledge about $\theta$, the MAP estimator is $$ \theta_{MAP} = \arg \max \prod_{n=1}^N p(\theta | t_n) \quad (\text{Posteriori Probability of } \theta) $$

Maximum Likelihood Estimator $\vec{w}_{ML}$

  • We assume the signal+noise model of single data $(\vec{x}, t)$ is $$ \begin{gathered} t = \vec{w}^T \phi(\vec{x}) + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \end{gathered} $$ of which $\vec{w}^T \phi(\vec{x})$ is the true model, $\epsilon$ is the perturbation/randomness.

  • Since $\vec{w}^T \phi(\vec{x})$ is deterministic/non-random, we have $$ t \sim \mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1}) $$

Exercise:

  • Derive the likelihood function for a single data $p(t_n|\vec{x}_n,\vec{w},\beta)$.
  • Derive the complete log likelihood function for the whole dataset $\ln p(\vec{t}|\mathcal{X},\vec{w},\beta)$.
  • Using maximum likelihood to estimate parameter $\vec{w}$.

Maximum Likelihood Estimator $\vec{w}_{ML}$

  • The likelihood function of $t$ is just probability density function (PDF) of $t$ $$ p(t|\vec{x},\vec{w},\beta) = \mathcal{N}(t|\vec{w}^T \phi(\vec{x}),\beta^{-1}) $$

  • For inputs $\mathcal{X}=(\vec{x}_1, \dots, \vec{x}_n)$ and target values $\vec{t}=(t_1,\dots,t_n)$, the data likelihood is $$ p(\vec{t}|\mathcal{X},\vec{w},\beta) = \prod_{n=1}^N p(t_n|\vec{x}_n,\vec{w},\beta) = \prod_{n=1}^N \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) $$

  • Notation Clarification

    • $p(t|x,w,\beta)$ it the PDF of $t$ whose distribution is parameterized by $x,\vec{w},\beta$.
    • $\mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1})$ is Gaussian distribution with mean $\vec{w}^T \phi(\vec{x})$ and variance $\beta^{-1}$.
    • $\mathcal{N}(t|\vec{w}^T \phi(\vec{x}),\beta^{-1})$ is the PDF of $\vec{t}$ which has Gaussian distribution $\mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1})$

Maximum Likelihood Estimator $\vec{w}_{ML}$: Derivation

  • Single data likelihood is $$ p(t_n|\vec{x}_n,\vec{w},\beta) = \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) = \frac{1}{\sqrt{2 \pi \beta^{-1}}} \exp \left \{ - \frac{1}{2 \beta^{-1}} (t_n - \vec{w}^T \phi(x_n))^2 \right \} $$

  • Single data log-likelihood is $$ \ln p(t_n|\vec{x}_n,\vec{w},\beta) = - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 $$ We use logarithm because maximizer of $f(x)$ is the same as maximizer of $\log f(x)$. Logarithm can convert product to summation which makes life easier.

  • Complete data log-likelohood is $$ \begin{aligned} \ln p(\vec{t}|\mathcal{X},\vec{w},\beta) &= \ln \left[ \prod_{n=1}^N p(t_n|\vec{x}_n,\vec{w},\beta) \right] = \sum_{n=1}^N \ln p(t_n|\vec{x}_n,\vec{w},\beta) \\ &= \sum_{n=1}^N \left[ - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \end{aligned} $$

  • Maximum likelihood estimate $\vec{w}_{ML}$ is $$ \begin{aligned} \vec{w}_{ML} &= \underset{\vec{w}}{\arg \max} \ln p(\vec{t}|\mathcal{X},\vec{w},\beta) \\ &= \underset{\vec{w}}{\arg \max} \sum_{n=1}^N \left[ - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \\ &= \underset{\vec{w}}{\arg \max} \sum_{n=1}^N \left[ - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \\ &= \underset{\vec{w}}{\arg \min} \sum_{n=1}^N \left[(\vec{w}^T \phi(x_n) - t_n)^2 \right] \end{aligned} $$

  • Familiar? Recall the objective function we minimized in least squares is $E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2$, so we could conclude that $$ \boxed{\vec{w}_{ML} = \hat{\vec{w}}_{LS} = \Phi^\dagger \vec{t}} $$

MAP Estimator $\vec{w}_{MAP}$

  • The MAP estimator is obtained by $$ \begin{aligned} \vec{w}_{MAP} &= \arg \max p(\vec{w}|\vec{t}, \mathcal{X},\beta) & & (\text{Posteriori Probability})\\ &= \arg \max \frac{p(\vec{w}, \vec{t}, \mathcal{X},\beta)}{p(\mathcal{X}, t, \beta)} \\ &= \arg \max \frac{p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}, \mathcal{X}, \beta)}{p(\mathcal{X}, t, \beta)} \\ &= \arg \max p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}, \mathcal{X}, \beta) & & (p(X, t, \beta) \text{ is irrelevant to} \ \vec{w})\\ &= \arg \max p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) p(\mathcal{X}) p(\beta) & & (\text{Independence}) \\ &= \arg \max p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) & & (\text{Likelihood} \times \text{Prior}) \end{aligned} $$ We are just using Bayes Theorem for the above steps.
  • The only difference from ML estimator is we have an extra term of PDF of $\vec{w}$. This is the prior belief of $\vec{w}$. Here, we assume, $$ \vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I) $$

Exercise: Derive the MAP Estimator of $\vec{w}$ and compare the solution with regularized linear regression. What is $\lambda$ in this case?

MAP Estimator $\vec{w}_{MAP}$: Derivation

  • $\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$ is multivariate Gaussian which has PDF $$ p(\vec{w}) = \frac{1}{\left( \sqrt{2 \pi \alpha^{-1}} \right)^N} \exp \left \{ -\frac{1}{2 \alpha^{-1}} \sum_{n=1}^N w_n^2 \right \} $$

  • So the MAP estimator is $$ \begin{aligned} \vec{w}_{MAP} &= \underset{\vec{w}}{\arg \max} \ p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) = \underset{\vec{w}}{\arg \max} \left[\ln p(\vec{t}|\vec{w}, \mathcal{X},\beta) + \ln p(\vec{w}) \right] \\ &= \underset{\vec{w}}{\arg \min} \left[ \sum_{n=1}^N \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 + \frac{\alpha}{2} \sum_{n=1}^N w_n^2 \right] \\ &= \underset{\vec{w}}{\arg \min} \left[ \sum_{n=1}^N \frac12 (\vec{w}^T \phi(x_n) - t_n)^2 + \frac12 \frac{\alpha}{\beta} \left \| \vec{w} \right \|^2 \right] \end{aligned} $$

  • Exactly the objective in regularized least squares! So $$ \boxed{ \vec{w}_{MAP} = \hat{\vec{w}}=\left(\Phi^T \Phi + \frac{\alpha}{\beta} I\right)^{-1} \Phi^T \vec{t} } $$

  • Compared with $\ell^2$ norm regularized least square, we have $\lambda = \frac{\alpha}{\beta}$.

Problem 5a: MAP estimation for Linear Regression with unusual Prior

Assume we have $n$ vectors $\vec{x}_1, \cdots, \vec{x}_n$. We also assume that for each $\vec{x}_i$ we have observed a target value $t_i$, where $$ \begin{gather} t_i = \vec{w}^T \vec{x_i} + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \end{gather} $$ where $\epsilon$ is the "noise term".

(a) Quick quiz: what is the likelihood given $\vec{w}$? That is, what's $p(t_i | \vec{x}_i, \vec{w})$?

Answer: $p(t_i | \vec{x}_i, \vec{w}) = \mathcal{N}(t_i|\vec{w}^\top \vec{x_i}, \beta^{-1}) = \frac{1}{(2\pi \sigma^2)^\frac{1}{2}} \exp{(-\frac{\beta}{2}(t_i - \vec{w}^\top \vec{x_i})^2)}$

Problem 5: MAP estimation for Linear Regression with unusual Prior

Assume we have $n$ vectors $\vec{x}_1, \cdots, \vec{x}_n$. We also assume that for each $\vec{x}_i$ we have observed a target value $t_i$, sampled IID. We will also put a prior on $\vec{w}$, using PSD matrix $\Sigma$. $$ \begin{gather} t_i = \vec{w}^T \vec{x_i} + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \\ \vec{w} \sim \mathcal{N}(0, \Sigma) \end{gather} $$ Note: the difference here is that our prior is a multivariate gaussian with non-identity covariance! Also we let $\mathcal{X} = \{\vec{x}_1, \cdots, \vec{x}_n\}$

(a) Compute the log posterior function, $\log p(\vec{w}|\vec{t}, \mathcal{X},\beta)$

Hint: use Bayes' Rule

(b) Compute the MAP estimate of $\vec{w}$ for this model

Hint: the solution is very similar to the MAP estimate for a gaussian prior with identity covariance