In [1]:
%matplotlib inline
from Lec05 import *
In this lecture, we will first look at how degree of linear regression and sample dataset size will cause overfitting in linear regression. To deal with overfitting, regularized least squares will be introduced. Finally, we will show regular linear regression and regularized linear regression can be interpreted from probablistic perspective each using maximum likelihood estimation and maximum a posteriori estimation.
In [2]:
regression_overfitting_degree(degree0=0, degree1=3,degree2=9,degree3=12)
Remark
- In the above plots, we try to predict the true sinusoidal curve hidden in the data with polynomial degrees 0, 3, 9 and 12.
- In the first plot (degree=0), we only get a horizontal line. The learned curve can neither fit the training data nor match true curve. This is called Underfitting.
- In the second plot (degree=3), the predicted plot fits both data and true curve perfectly. This is a good degree for our setting.
- In the third (degree=9) and fourth (degree=12) plots, as the degree increases, the traning data are fitted better but the learned curve deviates from the true curve further. This is called Overfitting.
- Explanations of why degree could impact the predicted curve will come in the Remark later.
In [3]:
regression_overfitting_datasetsize(size0 = 13, size1 = 50, size2 = 100, size3 = 500)
Remark
- In the above plots, we try to predict the true sinusoidal curve hidden in the data with polynomial degree 12 and training dataset size 13, 50, 100 and 500.
- In the first plot (size=13), overfitting occurs as we have seen just now.
- Comparing the four plots altogether, we could see as the size increases, overfitting diminishes. And although more and more data points cannot be fitted by the learned curve (training error is becoming higher), but it matches the true curve better (test error will be smaller).
- Explanations of why training data size impact the predicted curve will come in the Remark later.
In [4]:
regression_overfitting_curve()
Remark
- The plot below is the root mean squared error (RMSE) of training dataset and test dataset with respect to different degree and dataset size.
- NOTE: For simplicity of presentation, we divided the dataset into training set and test set. However, it's not legitimate to find the optimal hyperparameter based on the test set. We will talk about legitimate ways of doing this when we cover model selection and cross-validation.
- Combining the last 10 plots, we could have:
- Degree
- When degree is really small, the regressor is not powerful enough (i.e. degree is small) to learn the underlying true model. At this time, underfitting occurs. Both training error and test error are high.
- As degree increases, regressor become more powerful enough to roughly learn the true model but not that powerful to also take the noise into considerations. At this time, underfitting reduces. Both training error and test error will be smaller.
- As degree further increases, regressor become so powerful that it could fit most of the data in training dataset perfectly. And since the data are noisy, the learned model actually deviates from the true model. At this time, overfitting occurs. Training error could becomes very small (even 0 sometimes), while test error increases.
- Dataset Size
- When training dataset size is small, a powerful regressor (i.e. degree is high) can fit every single sample in training dataset. Therefore, noise is also considered. This is equivalent to the ending zone of the plot (left) with respect degree.
- As training dataset size increases, since the power of regressor is finite (degree is fixed), regressor starts to fail to fit every single sample. Instead it seeks to learn a curve such that samples will fall on both sides equally. Since the noise are assumed to 0 mean Gaussian noise which is symmetric with respect to 0, this is actually close to the underlying true curve.
- The training error and test error will converge to 0.3 which is the variance of noise (you could check this by examining the Python code). This is not a coincidence and can be derived. We will cover this later when we study bias-variance tradeoff
- $$\text{RMSE} = \sqrt{\frac{1}{N} \sum_{n=1}^{N} (y_n-t_n)^2}$$
In [5]:
regression_overfitting_coeffs()
Remark
- The table above corresponds to the coefficients (multiplied by 100 for better visualization) for different degrees.
- We could see that when overfitting occurs, we get some really crazy and large numbers!
- So one intuition to handle overfitting is to penalize large coefficients.
Recall the objective function we minimizes in last lecture is $$ E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 $$
To penalize the large coefficients, we will add one penalization/regularization term to it and minimize them altogether. $$ E(\vec{w}) = \underbrace{ \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 }_{E_D(\vec{w})}+ \underbrace{\boxed{\frac{\lambda}{2} \left \| \vec{w} \right \|^2}}_{E_W(\vec{w})} $$ of which $E_D(\vec{w})$ represents the term of sum of squared errors and $E_W(\vec{w})$ is the regularization term.
$\lambda$ is the regularization coefficient.
Based on what we have derived in last lecture, we could write the objective function as $$ \begin{align} E(\vec{w}) &= \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 + \frac{\lambda}{2} \left \| \vec{w} \right \|^2 \\ &= \frac12 \vec{w}^T \Phi^T \Phi \vec{w} - \vec{t}^T \Phi \vec{w} + \frac12 \vec{t}^T \vec{t} + \frac{\lambda}{2}\vec{w}^T\vec{w} \end{align} $$
The gradient is $$ \begin{align} \nabla_\vec{w} E(\vec{w}) &= \Phi^T \Phi \vec{w} - \Phi^T \vec{t} + \lambda \vec{w}\\ &= (\Phi^T \Phi + \lambda I)\vec{w} - \Phi^T \vec{t} \end{align} $$
Setting the gradient to 0, we will get the solution $$ \boxed{ \hat{\vec{w}}=(\Phi^T \Phi + \lambda I)^{-1} \Phi^T \vec{t} } $$
In the solution to ordinary least squares which is $\hat{\vec{w} }=(\Phi^T \Phi)^{-1} \Phi^T \vec{t}$, we cannot guarantee $\Phi^T \Phi$ is invertible. But in regularized least squares, if $\lambda > 0$, $\Phi^T \Phi + \lambda I$ is always invertible.
The $\ell^p$ norm of a vector $\vec{x}$ is defined as $$ \left \| \vec{x} \right \|_p = (\sum_{j=1}^{M} |x_j|^p)^\frac{1}{p} $$
For the regularized least squares above, we used $\ell^2$ norm. We could also use other $\ell^p$ norms for different regularizers and the objective function becomes $$ E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 + \frac{\lambda}{2} \left \| \vec{w} \right \|_p^p $$
Remark
- RSS is residual of sum of squares, which is the sum of squared errors $E_D(\vec{w})$ we use.
- This plot is to illustrate intuitively why lasso has sparser solution than ridge regression.
- Our objective function is $$ E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 + \frac{\lambda}{2} \left \| \vec{w} \right \|_p^p $$ To illustrate, lets look at an equivalent constrained problem. (Not exactly equivalent, just for illustration purpose) $$ \begin{aligned} & {\text{minimize}} & & \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2\\ & \text{subject to} & & \frac{\lambda}{2} \left \| \vec{w} \right \|_p^p \leq C \end{aligned} $$
- The objective function $\frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2$ is the red contours in the plots. The optimal solution without the constraint is $\hat{\vec{w}}_{OLS}$, which is just the solution to ordinary least squares. The farther we are away from $\hat{\vec{w}}_{OLS}$ the larger objective function will be.
- The feasible area that satisfying the constraint is the cyan area in the plots. The solution must fall in these areas. For lasso, it's a diamond; for ridge regression, it's a circle.
- As we increase the value of objective function, the contour will expand. The first time the contour touch the feasible cyan area, the touching point would be the optimal solution. This is because our solution should both be as close to $\hat{\vec{w}}_{OLS}$ as possible and fall in the feasible area.
- Since lasso has diamond area with four corners, the contours tend to touch the corner first. And since the corner is on the axis where coordinate has at least one zero component, this guarantees the sparsity of solution. On the contrary, the circle area in ridge regression cannot give us this property.
In [6]:
regression_regularization_plot()
Remark
- In the second plot, we can see that after the regularization term is added, the learned curve looks much more like the true curve than the learnd curve without regularization in the first plot.
- However, in the third plot, when the coefficient $\lambda$ for regularization is too large, the minimization will mostly focus on minimizing $\left \| \vec{w} \right \|$, thus deviating from the true curve in a great deal. (We will see the coefficients $\vec{w}$) are really small for this case in next slide)
- In the fourth plot, as we increase $\lambda$, the training error is monotonically increasing because we are far away from the task of minimizing sum of squared error $E_D(\vec{w})$. As for the the test error, it has minimum value near $\lambda=1$, which is the blancing point in the tradeoff between minimizing $E_D(\vec{w})$ and minimizing regularization term $E_W(\vec{w})$.
In [7]:
regression_regularization_coeff()
Remark
- From the table above, we can see the regularization term has effectively constrained those huge coefficients.
- However, when $\lambda$ is large ($\lambda = e^{10}$), the coefficients are "over-regularized".
Gaussian Distribution $$ \mathcal{N}(x, \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[ \frac{(x-\mu)^2}{2\sigma^2} \right] $$
Maximum Likelihood Estimation and Maximum a Posteriori Estimation (MAP)
We assume the signal+noise model of single data $(\vec{x}, t)$ is $$ \begin{gather} t = \vec{w}^T \phi(\vec{x}) + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \end{gather} $$ of which $\vec{w}^T \phi(\vec{x})$ is the true model, $\epsilon$ is the perturbation/randomness.
Since $\vec{w}^T \phi(\vec{x})$ is deterministic/non-random, we have $$ t \sim \mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1}) $$
The likelihood function of $t$ is just probability density function (PDF) of $t$ $$ p(t|\vec{x},\vec{w},\beta) = \mathcal{N}(t|\vec{w}^T \phi(\vec{x}),\beta^{-1}) $$
For inputs $\mathcal{X}=(\vec{x}_1, \dots, \vec{x}_n)$ and target values $\vec{t}=(t_1,\dots,t_n)$, the data likelihood is $$ p(\vec{t}|\mathcal{X},\vec{w},\beta) = \prod_{n=1}^N p(t_n|\vec{x}_n,\vec{w},\beta) = \prod_{n=1}^N \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) $$
Notation Clarification
Main Idea of Maximum Likelihood Estimate
Intuition about Maximum Likelihood Estimation
Single data likelihood is $$ p(t_n|\vec{x}_n,\vec{w},\beta) = \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) = \frac{1}{\sqrt{2 \pi \beta^{-1}}} \exp \left \{ - \frac{1}{2 \beta^{-1}} (t_n - \vec{w}^T \phi(x_n))^2 \right \} $$
Single data log-likelihood is $$ \ln p(t_n|\vec{x}_n,\vec{w},\beta) = - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 $$ We use logarithm because maximizer of $f(x)$ is the same as maximizer of $\log f(x)$. Logarithm can convert product to summation which makes life easier.
Complete data log-likelohood is $$ \begin{align} \ln p(\vec{t}|\mathcal{X},\vec{w},\beta) &= \ln \left[ \prod_{n=1}^N p(t_n|\vec{x}_n,\vec{w},\beta) \right] = \sum_{n=1}^N \ln p(t_n|\vec{x}_n,\vec{w},\beta) \\ &= \sum_{n=1}^N \left[ - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \end{align} $$
Maximum likelihood estimate $\vec{w}_{ML}$ is $$ \begin{align} \vec{w}_{ML} &= \underset{\vec{w}}{\arg \max} \ln p(\vec{t}|\mathcal{X},\vec{w},\beta) \\ &= \underset{\vec{w}}{\arg \max} \sum_{n=1}^N \left[ - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \\ &= \underset{\vec{w}}{\arg \max} \sum_{n=1}^N \left[ - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \\ &= \underset{\vec{w}}{\arg \min} \sum_{n=1}^N \left[(\vec{w}^T \phi(x_n) - t_n)^2 \right] \end{align} $$
Familiar? Recall the objective function we minimized in least squares is $E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2$, so we could conclude that $$ \boxed{\vec{w}_{ML} = \hat{\vec{w}}_{LS} = \Phi^\dagger \vec{t}} $$
$\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$ is multivariate Gaussian which has PDF $$ p(\vec{w}) = \frac{1}{\left( \sqrt{2 \pi \alpha^{-1}} \right)^N} \exp \left \{ -\frac{1}{2 \alpha^{-1}} \sum_{n=1}^N w_n^2 \right \} $$
So the MAP estimator is $$ \begin{align} \vec{w}_{MAP} &= \underset{\vec{w}}{\arg \max} \ p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) = \underset{\vec{w}}{\arg \max} \left[\ln p(\vec{t}|\vec{w}, \mathcal{X},\beta) + \ln p(\vec{w}) \right] \\ &= \underset{\vec{w}}{\arg \min} \left[ \sum_{n=1}^N \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 + \frac{\alpha}{2} \sum_{n=1}^N w_n^2 \right] \\ &= \underset{\vec{w}}{\arg \min} \left[ \sum_{n=1}^N \frac12 (\vec{w}^T \phi(x_n) - t_n)^2 + \frac12 \frac{\alpha}{\beta} \left \| \vec{w} \right \|^2 \right] \end{align} $$
Exactly the objective in regularized least squares! So $$ \boxed{ \vec{w}_{MAP} = \hat{\vec{w}}=\left(\Phi^T \Phi + \frac{\alpha}{\beta} I\right)^{-1} \Phi^T \vec{t} } $$
Remark
- Of the above expression, $\frac{\alpha}{\beta}$ corresponds to the regularization coefficient $\lambda$ we used in previous regulazied least squares.
- Priors: Represent prior beliefs about acceptable values for model parameters.
- Example: In linear regression, $\ell^2$ regularization can be interpreted as placing a Gaussian Prior on the regression coefficients.
- All statistical models and machine learning algorithms make assumptions.
- All reasoning is based on implicit assumptions.
- A Bayesian will tell you that his prior is a way of explicitly stating those assumptions.
- This can all get very philosophical, but...
- Bayesian reasoning is best seen as a useful tool.
- Many concepts in machine learning have Bayesian interpretations.
- Choice of loss / error function, regularization, etc.
- For a fully Bayesian take on machine learning, check out the Murphy textbook:
- We will cover more about Bayesian reasoning when we move to Bayesian Linear Regression, a linear regression that is used for streaming data.