Optimization

Maximum Likelihood Estimate

  • : $$\theta^{*} = arg \max_{\theta} \prod_i p(y_i =c \vert \mathbf{x}_i) \\ = arg \max_{\theta} \sum_i -\log p(y_i =c \vert \mathbf{x}_i) \\ $$

Parameter Gradients

$$\frac{\partial l}{\partial \theta} = \left[\frac{\partial l}{\partial \mathbf{W}^1},\frac{\partial l}{\partial b^1},...,\frac{\partial l}{\partial \mathbf{W}^L},\frac{\partial l}{\partial b^L}\right]$$

Empirical Risk Minimization

  • Optimization Problem

  • Taylor Series Expansion

    $$f(\mathbf{x}) \approx f(\mathbf{x}_0) + ..$$

Gradient Descent

  • Descent direction: $-\frac{\partial f}{\partial x}$
Procedure:
  • Initialize
  • Move in the direction of
    $\theta = \theta - \alpha $

Newton's method: Second Order Gradient Descent

$$\mathbf{x} = \mathbf{x}_0 + \left[\nabla_{\mathbf{x}}^2 f(\mathbf{x}_0)\right]^{-1} \nabla_\mathbf{x} f(\mathbf{x}_0)$$

Intuition: $\Rightarrow $ adpative steps

Hesian matrix captures whtehr the contours are circles or elipses. Hesian, will squish elipises to circles.

(put picture of circled contours vs. elipse contours)

  • Adapting how much I should move in each direction.
  • Along the directions woth large variance, we should adopt larger steps

Local and Global Optima

  • Critical points: $\mathbf{x}\in \mathbb{R}^d\vert \nabla_x f(x) = 0$
  • Curvature in direction $v$ is : $v^T\ \nabla_x^2 f(x)\ v$

  • Types of local minima:

    • local minima:
    • local maxima:
    • saddle point:
      Moving along some direcitons, the funcion is increasing, and along some other ones, the function is decreasing. So, we don't know in which direction to move.

Points to consider

  • Deep networks do not have a single global optimum

  • Hihly non-convex optimization problem

    • can permute hidden units and the connections to achieve the same function
    • hidden unit parameters are not identifiable
  • Optimization is often stuck at local minimum or plateau (and saddle points)

Variants of Gradient Descent

  • Batch Gradient Descent
  • Stochastic Gradient Descent

    • just using one sample at a time
  • Mini-Batch Gradient Descent

    • somewhere in between
    • using some number of samples

Intuition:

  • the first one, we update the parmateers after visiting all the samples.

    • This results in accurate derivatives, but slow.
  • in the second case, we update the parameters y computing the derivative using only one sample. The idea is that on average, the computed derivatives should be (roughly) the same.

    • This results in inaccurate derivatives,
  • Mini-Batch is a trade-off between the accurate derivatives and speed of computation.

Almost always, we should use mini-batch:

  • The function is non-convex, so there is no "right direction".
  • With mini-batch, we can jump out of local minima.
    In every iteration, we use a different subset of samples to compute derivatives.

Algorithm

Nesterov Accelerated Gradient Descent

Idea: remembering the previous direction

$$v_t = \gamma v_{t-1} + \alpha \nabla_\theta J(\theta)$$

$$\theta_t = \theta_{t-1} - v_t$$

In Nesterov's method, the order in which we move and take the derivative is different:

  • First move,
  • the take the gradient

Nesrerov showed that this method converges as quickly as second order approximation.

Comparing number of iterations:
if $\epsilon = \|f(x) - f(x^*)\|$ where $x^*$ is the true optima

  • $\#\text{ of iter} = \frac{1}{\epsilon}$
  • $\#\text{ of iter} = \frac{1}{\sqrt \epsilon}$

Adpative Learning Rate

We need differnet learning rate for each parameter.

Idea: Slow down paramaters that are fast changing, speed up slow changing parameters.

$$\theta_{t+1,i} = \theta_{t,i} - \frac{\alpha}{\sqrt{G_{ii} + \epsilon}} \nabla_\theta J(\theta_{t,i})$$

where $G_{ii}$ is a diagonal matrix: $G_{ii}^t = G_{ii}^{t-1} + \left(\nabla_\theta J(\theta_i)^t\right)^2$

$$\theta_{t+1} = \theta_{t} - \frac{\alpha}{\sqrt{G_{t} + \epsilon}} \nabla_\theta J(\theta_{t})$$

Now, we have a new problem. This way, we accumulate and eventually, the learning rate becomes very small.

  • Solution: as we go further waya in time, we forget the very old $G$. This insures that $G$ does not become very big.

RMSProp:

Adadelta:

The difference between RMSProp and Adadelta is that, the parameter units in RMSProp are lost, so Adadelta tries to fix that.

  • Gradients at time $t$: $g^t$
  • $G = gg^T$ but we only need the diagonal elements of $G$, which are $G_{ii}$
  • $G_{ii} = \sum_t^T {(g_{ii}^t)}^2$

Adaptive Moment Estimtion (ADAM)

$$\mathbf{m}_t = $$

SGD with Noise:

Add a small Gaussian noise to the parmaters in each iteration.

$$\theta = $$

Covaraince Matrix Adaptation Evolution Strategy (CMA-ES)

Gradient Free Optimization


In [ ]: