$$\mathbf{x} = \mathbf{x}_0 + \left[\nabla_{\mathbf{x}}^2 f(\mathbf{x}_0)\right]^{-1} \nabla_\mathbf{x} f(\mathbf{x}_0)$$
Intuition: $\Rightarrow $ adpative steps
Hesian matrix captures whtehr the contours are circles or elipses. Hesian, will squish elipises to circles.
(put picture of circled contours vs. elipse contours)
Curvature in direction $v$ is : $v^T\ \nabla_x^2 f(x)\ v$
Types of local minima:
Deep networks do not have a single global optimum
Hihly non-convex optimization problem
Optimization is often stuck at local minimum or plateau (and saddle points)
Stochastic Gradient Descent
Mini-Batch Gradient Descent
Intuition:
the first one, we update the parmateers after visiting all the samples.
in the second case, we update the parameters y computing the derivative using only one sample. The idea is that on average, the computed derivatives should be (roughly) the same.
Mini-Batch is a trade-off between the accurate derivatives and speed of computation.
Almost always, we should use mini-batch:
Idea: remembering the previous direction
$$v_t = \gamma v_{t-1} + \alpha \nabla_\theta J(\theta)$$
$$\theta_t = \theta_{t-1} - v_t$$
In Nesterov's method, the order in which we move and take the derivative is different:
Nesrerov showed that this method converges as quickly as second order approximation.
Comparing number of iterations:
if $\epsilon = \|f(x) - f(x^*)\|$ where $x^*$ is the true optima
We need differnet learning rate for each parameter.
Idea: Slow down paramaters that are fast changing, speed up slow changing parameters.
$$\theta_{t+1,i} = \theta_{t,i} - \frac{\alpha}{\sqrt{G_{ii} + \epsilon}} \nabla_\theta J(\theta_{t,i})$$
where $G_{ii}$ is a diagonal matrix: $G_{ii}^t = G_{ii}^{t-1} + \left(\nabla_\theta J(\theta_i)^t\right)^2$
$$\theta_{t+1} = \theta_{t} - \frac{\alpha}{\sqrt{G_{t} + \epsilon}} \nabla_\theta J(\theta_{t})$$
Now, we have a new problem. This way, we accumulate and eventually, the learning rate becomes very small.
The difference between RMSProp and Adadelta is that, the parameter units in RMSProp are lost, so Adadelta tries to fix that.
In [ ]: