Seminar 14

Newton method

Reminder

  1. Descent methods
  2. Descent directions
  3. Gradient descent
  4. Step size selection rules
  5. Convergence theorem
  6. Experiments

Drawbacks of gradient descent

  • Linear convergence
  • Dependence on the condition number

Can we fix both of them?

Idea of Newton method

Consider the problem

$$ \min\limits_{x\ \in \mathbb{R}^n} f(x). $$
  • Gradient descent $\equiv$ linear approximation of $f$
  • Newton method $\equiv$ quadratic approximation of $f$:
$$ f(x + h) \approx f(x) + \langle f'(x), h \rangle + \frac{1}{2}h^{\top}f''(x)h \to \min_{h} $$

From the necessary condition follows:

$$ f'(x) + f''(x) h = 0, \qquad h^* = -(f''(x))^{-1} f'(x) $$

Is the found direction descent?

Check sign of the scalar product $\langle f'(x), h^* \rangle$.

$$ \langle f'(x), h^* \rangle = -(f')^{\top}(x) (f''(x))^{-1} f'(x) < 0 \Leftarrow f''(x) \succ 0 $$

Q: what if for some $k^*$ hessian becomes indefinite?

Newton method

  • Classical Newton method: $\alpha_k \equiv 1$
  • Damped Newton method: $\alpha_k$ is selected in every iteration according to given rule
def NewtonMethod(f, x0, epsilon, **kwargs):

    x = x0

    while True:

        h = ComputeNewtonStep(x, f, **kwargs)

        if StopCriterion(x, f, h, **kwargs) < epsilon:

            break

        alpha = SelectStepSize(x, h, f, **kwargs)

        x = x + alpha * h

    return x

Convergence theorem (Y. E. Nesterov Introduction to convex optimization, $\S$ 1.2)

Theorem. Assue that $f(x)$ is

  • twice differentiable and its hessian is Lipschitz with constant $M$
  • there exists local minimizer where the hessian is positive definite
$$ f''(x^*) \succeq l\mathbf{I}, \; l > 0 $$
  • starting point $x_0$ is sufficiently close to the minimizer
$$ \|x_0 - x^*\|_2 \leq \frac{2l}{3M} $$

Then Newton method convrges quadratically:

$$ \|x_{k+1} - x^* \|_2 \leq \dfrac{M\|x_k - x^*\|^2_2}{2 (l - M\|x_k - x^*\|_2)} $$

Example

Use Newton method to find root of the following function

$$ \varphi(t) = \dfrac{t}{\sqrt{1+t^2}} $$

and find from what interval of $t_0$ it converges

Affine invariance

Consider function $f(x)$ и non-singular transformation with matrix $A$.

Check how Newton method direction will be changed after transformation $A$.

Let $x = Ay$ and $g(y) = f(Ay)$. Then

$$ g(y + u) \approx g(y) + \langle g'(y), u \rangle + \frac{1}{2} u^{\top} g''(y) u \to \min_{u} $$

and

$$ u^* = -(g''(y))^{-1} g'(y) \qquad y_{k+1} = y_k - (g''(y_k))^{-1} g'(y_k) $$

or

\begin{align*} y_{k+1} & = y_k - (A^{\top}f''(Ay_k)A)^{-1} A^{\top}f'(Ay_k)\\ & = y_k - A^{-1}(f''(Ay_k))^{-1}f'(Ay_k) \end{align*}

Thus,

$$ Ay_{k+1} = Ay_k - (f''(Ay_k))^{-1}f'(Ay_k) \quad x_{k+1} = x_k - (f''(x_k))^{-1}f'(x_k) $$

Therefore, direction given by Newton method is transformed in the similar way as coordinates!

Newton method with hessian modification

  • How to deal with possible not positive definiteness of the hessian in some iteration?
  • If $f''(x)$ is not positive definite, use positive definite matrix $f''(x) + \Delta E$
  • Matrix $\Delta E$ can be chosen in different ways using the following problem $$ \Delta E = \arg\min \|\Delta E\|, \quad \text{s.t. } f''(x) + \Delta E \succ 0 $$

    • $\|\cdot\|_2$: $\Delta E = \tau I$, where $\tau = \max(0, \delta - \lambda_{\min}(f''(x)))$, where $\delta > 0$ - given estimate of the minimal eigenvalue of the matrix $f''(x) + \Delta E$
    • What is $\Delta E$ if one uses $\|\cdot\|_F$?
  • As far as $\lambda(f''(x))$ is usually unavailable in every iteration, it is possible to modify Cholesky factorization algorithm such that it gives factor for the matrix $f''(x) + \Delta E$ instead of the initial matrix $f''(x)$

Computational complexity and experiments

Bottlenecks in Newton method:

  • composing and storing of hessian
  • solving of linear system
$$ f''(x_k)h = -f'(x_k) $$

Test problem

Remember problem of finding analytical center of the inequality sustem $Ax \leq 1$ subject to $|x_i| \leq 1$

$$ f(x) = - \sum_{i=1}^m \log(1 - a_i^{\top}x) - \sum\limits_{i = 1}^n \log (1 - x^2_i) \to \min_x $$$$ f'(x) - ? \quad f''(x) - ? $$


In [1]:
import numpy as np

USE_COLAB = False
if USE_COLAB:
    !pip install git+https://github.com/amkatrutsa/liboptpy
        
import liboptpy.unconstr_solvers as methods
import liboptpy.step_size as ss

n = 1000
m = 200
x0 = np.zeros((n,))
A = np.random.rand(n, m) * 10

Exact solution with CVXPy


In [2]:
import cvxpy as cvx
x = cvx.Variable((n, 1))

obj = cvx.Minimize(cvx.sum(-cvx.log(1 - A.T * x)) - 
                   cvx.sum(cvx.log(1 - cvx.square(x))))
prob = cvx.Problem(obj)
prob.solve(solver="SCS", verbose=True, max_iters=1000)
print("Optimal value =", prob.value)


----------------------------------------------------------------------------
	SCS v1.2.6 - Splitting Conic Solver
	(c) Brendan O'Donoghue, Stanford University, 2012-2016
----------------------------------------------------------------------------
Lin-sys: sparse-indirect, nnz in A = 205200, CG tol ~ 1/iter^(2.00)
eps = 1.00e-03, alpha = 1.50, max_iters = 1000, normalize = 1, scale = 1.00
Variables n = 3200, constraints m = 6600
Cones:	soc vars: 3000, soc blks: 1000
	exp vars: 3600, dual exp vars: 0
Setup time: 1.23e-02s
----------------------------------------------------------------------------
 Iter | pri res | dua res | rel gap | pri obj | dua obj | kap/tau | time (s)
----------------------------------------------------------------------------
     0|      inf       inf       nan      -inf      -inf       inf  3.26e-02 
   100| 1.62e+01  3.27e+00  9.82e-03 -8.79e+03 -8.62e+03  1.82e-13  1.30e+00 
   200| 1.44e+00  9.74e-01  1.70e-03 -1.94e+03 -1.93e+03  3.25e-13  2.60e+00 
   300| 4.54e-01  5.36e-01  2.15e-04 -1.60e+03 -1.60e+03  3.44e-13  3.76e+00 
   400| 2.20e-01  2.78e-01  1.06e-04 -1.47e+03 -1.47e+03  3.45e-13  5.08e+00 
   500| 1.30e-01  1.30e-01  1.62e-04 -1.41e+03 -1.41e+03  3.43e-13  6.70e+00 
   600| 7.59e-02  4.81e-02  1.24e-04 -1.38e+03 -1.39e+03  3.40e-13  8.32e+00 
   700| 3.82e-02  9.90e-03  6.79e-05 -1.37e+03 -1.37e+03  3.38e-13  9.85e+00 
   800| 1.61e-02  3.15e-03  2.64e-05 -1.37e+03 -1.37e+03  3.37e-13  1.13e+01 
   900| 7.50e-03  4.82e-03  5.41e-06 -1.37e+03 -1.37e+03  3.37e-13  1.28e+01 
  1000| 6.01e-03  3.14e-03  1.71e-06 -1.37e+03 -1.37e+03  3.37e-13  1.43e+01 
----------------------------------------------------------------------------
Status: Solved/Inaccurate
Hit max_iters, solution may be inaccurate
Timing: Solve time: 1.43e+01s
	Lin-sys: avg # CG iterations: 3.41, avg solve time: 4.76e-03s
	Cones: avg projection time: 9.37e-03s
----------------------------------------------------------------------------
Error metrics:
dist(s, K) = 7.0658e-04, dist(y, K*) = 1.1102e-16, s'y/|s||y| = -1.5183e-11
|Ax + s - b|_2 / (1 + |b|_2) = 6.0093e-03
|A'y + c|_2 / (1 + |c|_2) = 3.1396e-03
|c'x + b'y| / (1 + |c'x| + |b'y|) = 1.7108e-06
----------------------------------------------------------------------------
c'x = -1367.4802, -b'y = -1367.4755
============================================================================
Optimal value = -1367.4801856097865

Auxilliary functions


In [3]:
f = lambda x: -np.sum(np.log(1 - A.T.dot(x))) - np.sum(np.log(1 - x*x))
grad_f = lambda x: np.sum(A.dot(np.diagflat(1 / (1 - A.T.dot(x)))), axis=1) + 2 * x / (1 - np.power(x, 2))
hess_f = lambda x: (A.dot(np.diagflat(1 / (1 - A.T.dot(x))**2))).dot(A.T) + np.diagflat(2 * (1 + x**2) / (1 - x**2)**2)

Implementation of Newton method


In [4]:
def Newton(f, gradf, hessf, x0, epsilon, num_iter, line_search, 
                    disp=False, callback=None, **kwargs):
    x = x0.copy()
    iteration = 0
    opt_arg = {"f": f, "grad_f": gradf}
    for key in kwargs:
        opt_arg[key] = kwargs[key]
    while True:
        gradient = gradf(x)
        hess = hessf(x)
        h = np.linalg.solve(hess, -gradient)
        alpha = line_search(x, h, **opt_arg)
        x = x + alpha * h
        if callback is not None:
            callback(x)
        iteration += 1
        if disp:
            print("Current function val =", f(x))
            print("Current gradient norm = ", np.linalg.norm(gradf(x)))
        if np.linalg.norm(gradf(x)) < epsilon:
            break
        if iteration >= num_iter:
            break
    res = {"x": x, "num_iter": iteration, "tol": np.linalg.norm(gradf(x))}
    return res

Comparison with gradient descent


In [5]:
newton = methods.so.NewtonMethod(f, grad_f, hess_f, ss.Backtracking("Armijo", rho=0.9, beta=0.1, init_alpha=1.))
x_newton = newton.solve(x0, tol=1e-6, max_iter=50, disp=True)

gd = methods.fo.GradientDescent(f, grad_f, ss.Backtracking("Armijo", rho=0.9, beta=0.1, init_alpha=1.))
x_gd = gd.solve(x0, tol=1e-6, max_iter=50, disp=True)


Required tolerance achieved!
Convergence in 14 iterations
Function value = -1368.6652957529736
Norm of gradient = 7.596291750182202e-10
Required tolerance achieved!
Convergence in 47 iterations
Function value = -1368.6652957529739
Norm of gradient = 3.0014120117562214e-09
/Users/alex/anaconda3/envs/cvxpy/lib/python3.6/site-packages/ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in log
  """Entry point for launching an IPython kernel.

In [6]:
%matplotlib inline
import matplotlib.pyplot as plt

if not USE_COLAB:
    plt.rc("text", usetex=True)
    
plt.figure(figsize=(12, 8))
# Newton
plt.semilogy([np.linalg.norm(grad_f(x)) for x in newton.get_convergence()], label="$\| f'(x_k) \|^{N}_2$")
# Gradient
plt.semilogy([np.linalg.norm(grad_f(x)) for x in gd.get_convergence()], label="$\| f'(x_k) \|^{G}_2$")
plt.xlabel(r"Number of iterations, $k$", fontsize=26)
plt.ylabel(r"Convergence rate", fontsize=26)
plt.xticks(fontsize = 24)
plt.yticks(fontsize = 24)
plt.legend(loc="best", fontsize=24)


Out[6]:
<matplotlib.legend.Legend at 0xd2456fe80>

Comparison of running time


In [7]:
%timeit newton.solve(x0, tol=1e-6, max_iter=50)
%timeit gd.solve(x0, tol=1e-6, max_iter=50)


612 ms ± 138 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
/Users/alex/anaconda3/envs/cvxpy/lib/python3.6/site-packages/ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in log
  """Entry point for launching an IPython kernel.
256 ms ± 41.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Newton method gives very accurate solution in long time
  • Gradient descent gives not very accurate solution but much faster
  • Sometimes inaccurate solution is not critical issue, therefore gradient descent can be the method of choice

Pro & Contra

Pro

  • quadratic convergence near the solution
  • high accurate solution
  • affine invariance

Contra

  • it is necessary to store hessian in every iteration: $O(n^2)$ memory
  • it is necessary to solve linear systems: $O(n^3)$ flops in general case
  • hessian can be singular or ill-conditioned

Summary of comparison with GD (B.T. Polyak Introduction to optimization, Ch. 3, $\S$ 1 )

Method Convergence speed Complexity Affine invariance Restrictions to $f(x)$
GD Global linear $O(n) + $ step size search No Differentiable, Lipschits gradient
Newton method Local quadratic $O(n^3) + $ step size search Yes Twice differentiable; Lipschitz and positive definite hessian

What's next?

  • Complexity: how to deal with solving linear systems and storing of hessian?
  • Convergence: how to combine local quadratic and global linear convegence? It's desirable to get global quadratic convergence!
  • Restrictions to $f(x)$ should be minimized
  • Quasi-Newton methods often solve these issues

Recap

  1. Newton method
  2. Convergence theorem
  3. Comparison with gradient descent
  4. Experiments