Optimization

Maximum Likelihood Estimate

: $$\theta^{*} = arg \max_{\theta} \prod_i p(y_i =c \vert \mathbf{x}_i) \\ = arg \max_{\theta} \sum_i -\log p(y_i =c \vert \mathbf{x}_i) \\ $$

Parameter Gradients

$$\frac{\partial l}{\partial \theta} = \left[\frac{\partial l}{\partial \mathbf{W}^1},\frac{\partial l}{\partial b^1},...,\frac{\partial l}{\partial \mathbf{W}^L},\frac{\partial l}{\partial b^L}\right]$$

Empirical Risk Minimization

Optimization Problem
Taylor Series Expansion

$$f(\mathbf{x}) \approx f(\mathbf{x}_0) + ..$$

Gradient Descent

Descent direction: $-\frac{\partial f}{\partial x}$

Procedure:

Initialize
Move in the direction of
$\theta = \theta - \alpha $

Newton's method: Second Order Gradient Descent

$$\mathbf{x} = \mathbf{x}_0 + \left[\nabla_{\mathbf{x}}^2 f(\mathbf{x}_0)\right]^{-1} \nabla_\mathbf{x} f(\mathbf{x}_0)$$

Intuition: $\Rightarrow $ adpative steps

Hesian matrix captures whtehr the contours are circles or elipses. Hesian, will squish elipises to circles.

(put picture of circled contours vs. elipse contours)

Adapting how much I should move in each direction.
Along the directions woth large variance, we should adopt larger steps

Local and Global Optima

Critical points: $\mathbf{x}\in \mathbb{R}^d\vert \nabla_x f(x) = 0$
Curvature in direction $v$ is : $v^T\ \nabla_x^2 f(x)\ v$
Types of local minima:
- local minima:
- local maxima:
- saddle point:
  Moving along some direcitons, the funcion is increasing, and along some other ones, the function is decreasing. So, we don't know in which direction to move.

Points to consider

Deep networks do not have a single global optimum
Hihly non-convex optimization problem
- can permute hidden units and the connections to achieve the same function
- hidden unit parameters are not identifiable
Optimization is often stuck at local minimum or plateau (and saddle points)

Variants of Gradient Descent

Batch Gradient Descent

Stochastic Gradient Descent
- just using one sample at a time
Mini-Batch Gradient Descent
- somewhere in between
- using some number of samples

Intuition:

the first one, we update the parmateers after visiting all the samples.
- This results in accurate derivatives, but slow.
in the second case, we update the parameters y computing the derivative using only one sample. The idea is that on average, the computed derivatives should be (roughly) the same.
- This results in inaccurate derivatives,
Mini-Batch is a trade-off between the accurate derivatives and speed of computation.

Almost always, we should use mini-batch:

The function is non-convex, so there is no "right direction".
With mini-batch, we can jump out of local minima.
In every iteration, we use a different subset of samples to compute derivatives.

Algorithm

Nesterov Accelerated Gradient Descent

Idea: remembering the previous direction

$$v_t = \gamma v_{t-1} + \alpha \nabla_\theta J(\theta)$$

$$\theta_t = \theta_{t-1} - v_t$$

In Nesterov's method, the order in which we move and take the derivative is different:

First move,
the take the gradient

Nesrerov showed that this method converges as quickly as second order approximation.

Comparing number of iterations:
if $\epsilon = \|f(x) - f(x^*)\|$ where $x^*$ is the true optima

$\#\text{ of iter} = \frac{1}{\epsilon}$
$\#\text{ of iter} = \frac{1}{\sqrt \epsilon}$

Adpative Learning Rate

We need differnet learning rate for each parameter.

Idea: Slow down paramaters that are fast changing, speed up slow changing parameters.

$$\theta_{t+1,i} = \theta_{t,i} - \frac{\alpha}{\sqrt{G_{ii} + \epsilon}} \nabla_\theta J(\theta_{t,i})$$

where $G_{ii}$ is a diagonal matrix: $G_{ii}^t = G_{ii}^{t-1} + \left(\nabla_\theta J(\theta_i)^t\right)^2$

$$\theta_{t+1} = \theta_{t} - \frac{\alpha}{\sqrt{G_{t} + \epsilon}} \nabla_\theta J(\theta_{t})$$

Now, we have a new problem. This way, we accumulate and eventually, the learning rate becomes very small.

Solution: as we go further waya in time, we forget the very old $G$. This insures that $G$ does not become very big.

RMSProp:

Adadelta:

The difference between RMSProp and Adadelta is that, the parameter units in RMSProp are lost, so Adadelta tries to fix that.

Gradients at time $t$: $g^t$
$G = gg^T$ but we only need the diagonal elements of $G$, which are $G_{ii}$
$G_{ii} = \sum_t^T {(g_{ii}^t)}^2$

Adaptive Moment Estimtion (ADAM)

$$\mathbf{m}_t = $$

SGD with Noise:

Add a small Gaussian noise to the parmaters in each iteration.

$$\theta = $$

Covaraince Matrix Adaptation Evolution Strategy (CMA-ES)

Gradient Free Optimization



In [ ]: