Quasi Newton methods

Newton method vs. GD (B.T. Polyak Introduction to optimization, Ch. 3, $\S$ 1 )

Method	Convergence speed	Complexity	Affine invariance	Restrictions to $f(x)$
GD	Global linear	$O(n) + $ step size search	No	Differentiable, Lipschits gradient
Newton method	Local quadratic	$O(n^3) + $ step size search	Yes	Twice differentiable; Lipschitz and positive definite hessian

How reduce computational and memory complexity?

Computational complexity can be reduced with
- quasi Newton methods aka methods of variable metric
- they require storing of $n \times n$ matrix
Computational and memory complexity can be reduced with
- limited memory quasi Newton methods, e.g. L-BFGS (Limited Broyden-Fletcher-Goldfarb-Shanno)
- they do not require storing of any matrtix
- instead of matrix they require storing $k \ll n$ vectors from из $\mathbb{R}^n$

Unified approach to get Newton method and gradient descent

Gradient descent is obtained from the linear approximation (or quadratic upper bound):

$$ f_G(x) \approx f(y) + \langle f'(y), x - y \rangle + \frac{1}{2}(x-y)^{\top} \frac{1}{\alpha}I(x - y) $$

where $\alpha \in (0, 1/L], f(x) \leq f_G(x)$, which means $f_G$ is global upper estimate of $f(x)$

Newton method is obtained from second order approximation

$$ f_N(x) \approx f(y) + \langle f'(y), x - y \rangle + \frac{1}{2} (x-y)^{\top}f''(y)(x-y) $$

Idea: use intermediate approximation in the form

$$ f_q(x) \approx f(y) + \langle f'(y), x - y \rangle + \frac{1}{2} (x-y)^{\top}{\color{red}{B(y)}}(x-y), $$

which gives the following rule for update current point:

$$ x_{k+1} = x_k - \alpha_k B^{-1}_k f'(x_k) = x_k - \alpha_k H_k f'(x_k) $$

Some history...

The first quasi Newton method was proposed by physicist William Davidon in the middle of 1950s for accelerating of computations using unstable computers, which crashed before the calculation was finished
His paper about this method was not accepted for publication and it was just technical report more than 30 years
It was finally published in 1991 in the first volume of SIAM Journal on Optimization

General scheme of quasi Newton methods

def QuasiNewtonMethod(f, x0, epsilon, **kwargs):

    x = x0

    H = I

    while True:

        h = -H.dot(grad_f(x))

        if StopCriterion(x, f, h, **kwargs) < epsilon:

            break

        alpha = SelectStepSize(x, h, f, **kwargs)

        x = x + alpha * h

        H = UpdateH(H, f(x), grad_f(x))

    return x

How to find $B_{k+1}$?

In the point $x_{k+1}$ the following relation holds:

$$ f_q(h) \approx f(x_{k+1}) + \langle f'(x_{k+1}), h \rangle + \frac{1}{2}h^{\top}B_{k+1}h $$

From the definition follows that $B_{k+1} \in \mathbb{S}^n_{++}$. What are the natural restrictions for $f_q(h)$?

$$ f_q'(-\alpha_k h_k) = f'(x_k) \qquad f'_q(0) = f'(x_{k+1}), $$

where the first equation gives

$$ f'(x_{k+1}) - \alpha_k B_{k+1}h_k = f'(x_k), $$

and the second one holds by construction.

Secant equation

The first equation gives

$$ B_{k+1}s_k = y_k, $$

where $s_k = x_{k+1} - x_k$ and $y_k = f'(x_{k+1}) - f'(x_k)$.

This equation has solution only if $s^{\top}_k y_k > 0$. Why?

Q: does the relation between vectors $s_k$ and $y_k$ always hold?

Hint: remember about Wolfe condition

Q: is matrix $B_{k+1}$ unique?

How define $B_{k+1}$ uniquely?

\begin{align*} & \min_B \| B_k - B \| \\ \text{s.t. } & B = B^{\top}\\ & Bs_k = y_k \end{align*}

DFP (Davidon-Fletcher-Powell)

$$ B_{k+1} = (I - \rho_k y_k s^{\top}_k)B_k(I - \rho_k s_ky^{\top}_k) + \rho_k y_k y^{\top}_k, $$

where $\rho_k = \dfrac{1}{y^{\top}_k s_k}$,

or with Sherman-Morrison-Woodbury formula

$$ B^{-1}_{k+1} = H_{k+1} = H_k - \dfrac{H_ky_k y_k^{\top}H_k}{y^{\top}_kH_ky_k} + \dfrac{s_ks^{\top}_k}{y^{\top}_ks_k} $$

Q: what rank has difference between $B_{k+1} (H_{k+1})$ and $B_{k} (H_{k})$?

Summary

General idea of quasi Newton methods:

instead of compute complete hessian in every iteration,

update current hessian or its approximtion with easy computing

transformations

BFGS

Q: what is natural modification of DFP method?

\begin{align*} & \min_H \| H_k - H \| \\ \text{s.t. } & H = H^{\top}\\ & Hy_k = s_k \end{align*}

Update equation for BFGS:

$$ H_{k+1} = (I - \rho_k s_ky^{\top}_k)H_k(I - \rho_k y_k s^{\top}_k) + \rho_k s_k s^{\top}_k, $$

where $\rho_k = \dfrac{1}{y^{\top}_k s_k}$

Implementation details

No operations with complexity $O(n^3)$, e.g. exclude matrix by matrix multiplication and solving linear system (cf. implementation in SciPy)
Only Wolfe rule guarantees that $y_k^{\top}s_k > 0$
Parameters in the Wolfe rule are usually the following
- $\alpha_0 = 1$ is necessary for superlinear convergence
- $\beta_1 = 10^{-4}$, $\beta_2 = 0.9$
Initialization of $H_0$
- identity matrix
- $H_0 = \frac{y_0^{\top}s_0}{y_0^{\top}y_0}I$ after first step, but before update $H_1$. In computing $x_1$, $H_0 = I$ is used
- $H_0 = \delta \|g_0\|^{-1}_2 I$, parameter $\delta$ is required in advance, $g_0$ is gradient in point $x_0$
While using $B$ instead of $H$, one has to store $B$ implicitly in the form of $LDL^{\top}$ decomposition and update this decomposition, not the matrix itself. This is performed for $O(n^2)$. Computing $h_k$ is equivalent to solving linear systen with factorized matrix, therefore it requires $O(n^2)$, too. This approach controls stability of the method with diagonal value of the matrix $D$. Practically standard choice is to use $H$

Convergence

Theorem

Let $f$ be twice continuously differentiable and its hessian is Lipschitz. Also assume that sequance generated by BFGS method converges to minimizer $x^*$ such that

$$ \sum_{k=1}^{\infty} \|x_k - x^*\| < \infty. $$

Then $x_k \to x^*$ suprlinearly.

Self-correcting property

If BFGS gives poor hessian approximation in some iteration, then after some iterations this faulty will be fixed automatically, i.e. the method corrects itself
This propety is active only in the case of appropriate step size selection, for example with the Wolfe rule
DFP method is significantly worse in correcting poor hessian approximation
This property is illustrated in the experiments below

Limited-memory BFGS (L-BFGS)

BFGS requires not the matrix $H$, but function that computes prodcut of this matrix by antigradient
As far as we need local accurate hessian approximation in every iteration, old vectors $s$ and $y$ can decrease quality od hessian approximation

Idea

Storing $k \ll n$ last vectors $s$ and $y$ - memory reduction from $n^2$ to $kn$
Computing matrix by vector product in the two-for-loop recursion manner without explicit forming matrix $H$

Relationship with non-linear conjugate gradient method

In the Hestens-Stiefel method

$$ h_{k+1} = -f'(x_{k+1}) + \beta_{k+1} h_{k}, \quad \beta_{k+1} = \frac{y_k^{\top}f'(x_{k+1})}{y_k^{\top} h_k} $$

$$ h_{k+1} = -\left(I - \frac{s_k y_k^{\top}}{y_k^{\top}s_k}\right)f'(x_{k+1}) = -\hat{H}_{k+1} f'(x_{k+1}) $$

Matrix $\hat{H}_{k+1}$ is nonsymmetric and not positive definite, but matrix

$$ H_{k+1} = \left(I - \frac{s_k y_k^{\top}}{y_k^{\top}s_k}\right)\left(I - \frac{y_k s_k^{\top}}{y_k^{\top}s_k}\right) + \frac{s_ks_k^{\top}}{y_k^{\top}s_k} $$

satisfies all requirements on the matrix in BFGS and equation is equal to the update of $H_k$, if $H_k = I$, i.e. $k=1$ in L-BFGS and $H_0 = I$

Moreover, if the step size is selected with steepest descent method, Hestens-Stiefel formula and L-BFGS formula for $k=1$ are the same

Barzilai-Borwein method

The first paper about this method was published in 1988, in the journal IMA Journal of Numerical Analysis
The paper from NIPS-2016 proposed its modification in the case of stochastic gradient estimation
Idea: combination of steepest descent method and quasi Newton method

Method idea

Steepest descent: $x_{k+1} = x_k - \alpha_k f'(x_k)$, $\alpha_k = \arg \min\limits_{\alpha > 0} f(x_{k+1})$
Newton method $x_{k+1} = x_k - (f''(x_k))^{-1} f'(x_k)$
Approximate of the hessian with diagonal matrix

$$ \alpha_k f'(x_k) = \alpha_k I f'(x_k) = \left( \frac{1}{\alpha_k} I \right)^{-1} f'(x_k) \approx f''(x_k))^{-1} f'(x_k) $$

How to find $\alpha_k$?

Secant equation again

For exact hessian

$$ f''(x_{k})(x_{k} - x_{k-1}) = f'(x_{k}) - f'(x_{k-1}) $$

For approximate hessian

$$ \alpha_k^{-1} s_{k-1} \approx y_{k-1} $$

Problem of approximation one vector with scaling of the other vector
The simplest quasi Newton method is reduced to the method of step size selection

Three ways to find $\alpha_k$

The first way
- Problem
  
  $$ \min_{\beta} \|\beta s_{k-1} - y_{k-1} \|^2_2 $$
- Solution
  
  $$ \alpha = \frac{1}{\beta} = \frac{s^{\top}_{k-1} s_{k-1}}{s^{\top}_{k-1} y_{k-1}} $$
The second way
- Problem
  
  $$ \min_{\alpha} \| s_{k-1} - \alpha y_{k-1} \|^2_2 $$
- Solution
  
  $$ \alpha = \frac{s^{\top}_{k-1} y_{k-1}}{y^{\top}_{k-1} y_{k-1}} $$
The third way is called non-monotone line search: specific mofifcation of the Armijo rule taking into account history of changing objective function values, more details see in the paper 2004, in SIAM Journal on Optimization

Experiments

Analytical center of the linear inequality system

$$ f(x) = - \sum_{i=1}^m \log(1 - a_i^{\top}x) - \sum\limits_{i = 1}^n \log (1 - x^2_i) \to \min_x $$



In [1]:

    
import numpy as np
import liboptpy.unconstr_solvers as methods
import liboptpy.step_size as ss
%matplotlib inline
import matplotlib.pyplot as plt
import scipy.optimize as scopt
plt.rc("text", usetex=True)



In [2]:

    
n = 3000
m = 100
x0 = np.zeros(n)
max_iter = 100
tol = 1e-5
A = np.random.rand(m, n) * 10



In [3]:

    
f = lambda x: -np.sum(np.log(1 - A.dot(x))) - np.sum(np.log(1 - x*x))
grad_f = lambda x: np.sum(A.T / (1 - A.dot(x)), axis=1) + 2 * x / (1 - np.power(x, 2))



In [4]:

    
def bb_method(f, gradf, x0, tol=1e-6, maxiter=100, callback=None, alpha_type=1):
    it = 0
    x_prev = x0.copy()
    current_tol = np.linalg.norm(gradf(x_prev))
    alpha = 1e-4
    while current_tol > tol and it < maxiter:
        it += 1
        current_grad = gradf(x_prev)
        if it != 1:
            g = current_grad - prev_grad
            if alpha_type == 1:
                alpha = g.dot(s) / g.dot(g)
            elif alpha_type == 2:
                alpha = s.dot(s) / g.dot(s)
        if callback:
            callback(x_prev)
        x_next = x_prev - alpha * current_grad
        current_tol = np.linalg.norm(gradf(x_next))
        prev_grad = current_grad
        s = x_next - x_prev
        x_prev = x_next
    if callback:
        callback(x_prev)
    return x_next



In [5]:

    
method = {
    "BB 1": methods.fo.BarzilaiBorweinMethod(f, grad_f, init_alpha=1e-4, type=1),
    "BFGS": methods.fo.BFGS(f, grad_f),
    "DFP": methods.fo.DFP(f, grad_f),
    "LBFGS": methods.fo.LBFGS(f, grad_f),
}



In [6]:

    
for m in method:
    print("\t Method {}".format(m))
    _ = method[m].solve(x0=x0, tol=tol, max_iter=max_iter, disp=True)

print("\t Method BFGS Scipy")
scopt_conv = []
scopt_res = scopt.minimize(f, x0, method="BFGS", jac=grad_f, callback=lambda x: scopt_conv.append(x), 
                           tol=tol, options={"maxiter": max_iter})
print("Result: {}".format(scopt_res.message))
if scopt_res.success:
    print("Convergence in {} iterations".format(scopt_res.nit))
print("Function value = {}".format(f(scopt_res.x)))









    



	 Method BB 1
Required tolerance achieved!
Convergence in 10 iterations
Function value = -706.5952809623045
Norm of gradient = 5.34264130475429e-06
	 Method BFGS






    



/Users/alex/anaconda3/envs/cvxpy/lib/python3.6/site-packages/ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in log
  """Entry point for launching an IPython kernel.






    



Required tolerance achieved!
Convergence in 24 iterations
Function value = -706.5952809622966
Norm of gradient = 6.928479420684621e-06
	 Method DFP
Maximum iteration exceeds!
Convergence in 100 iterations
Function value = -706.5950469552579
Norm of gradient = 0.03140781536901585
	 Method LBFGS
Required tolerance achieved!
Convergence in 8 iterations
Function value = -706.5952809623017
Norm of gradient = 5.766967682893029e-06
	 Method BFGS Scipy
Result: Optimization terminated successfully.
Convergence in 16 iterations
Function value = -706.5952809618987



In [7]:

    
plt.figure(figsize=(8, 6))

for m in method:
    plt.semilogy([np.linalg.norm(grad_f(x)) for x in method[m].get_convergence()], label=m)

plt.semilogy([np.linalg.norm(grad_f(x)) for x in [x0] + scopt_conv], label="BFGS SciPy")
plt.ylabel("$\|f'(x_k)\|_2$", fontsize=18)
plt.xlabel("Number of iterations, $k$", fontsize=18)
plt.legend(fontsize=18)
plt.xticks(fontsize=18)
_ = plt.yticks(fontsize=18)



In [9]:

    
for m in method:
    print("\t Method {}".format(m))
    %timeit method[m].solve(x0=x0, tol=tol, max_iter=max_iter)

%timeit scopt.minimize(f, x0, method="BFGS", jac=grad_f, tol=tol, options={"maxiter": max_iter})









    



	 Method BB 1
6.03 ms ± 59.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
	 Method BFGS






    



/Users/alex/anaconda3/envs/cvxpy/lib/python3.6/site-packages/ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in log
  """Entry point for launching an IPython kernel.






    



5.82 s ± 231 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
	 Method DFP






    



/Users/alex/anaconda3/envs/cvxpy/lib/python3.6/site-packages/ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in log
  """Entry point for launching an IPython kernel.






    



10.2 s ± 100 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
	 Method LBFGS






    



/Users/alex/anaconda3/envs/cvxpy/lib/python3.6/site-packages/ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in log
  """Entry point for launching an IPython kernel.






    



28.5 ms ± 571 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
14.1 s ± 666 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Ill-conditioned problem



In [8]:

    
n = 50
D = np.arange(1, n+1)
U = np.random.randn(n, n)
U, _ = np.linalg.qr(U)
A = U.dot(np.diag(D)).dot(U.T)
b = np.random.randn(n)
eig_vals = np.linalg.eigvals(A)
print("Condition number = {}".format(np.max(eig_vals) / np.min(eig_vals)))









    



Condition number = 50.000000000000334



In [9]:

    
f = lambda x: 0.5 * x.T.dot(A.dot(x)) - b.dot(x)
gradf = lambda x: A.dot(x) - b
x0 = np.random.randn(n)



In [21]:

    
method = {
    "BB 1": methods.fo.BarzilaiBorweinMethod(f, gradf, init_alpha=1e-4, type=1),
    "BB 2": methods.fo.BarzilaiBorweinMethod(f, gradf, init_alpha=1e-4, type=2),
    "BFGS": methods.fo.BFGS(f, gradf),
    "DFP": methods.fo.DFP(f, gradf),
    "GD": methods.fo.GradientDescent(f, gradf, ss.ExactLineSearch4Quad(A, b)),
    "LBFGS": methods.fo.LBFGS(f, gradf, hist_size=10),
}



In [22]:

    
for m in method:
    print("\t Method {}".format(m))
    _ = method[m].solve(x0=x0, tol=tol, max_iter=max_iter, disp=True)

print("\t Method BFGS Scipy")

scopt_conv = []
scopt_res = scopt.minimize(f, x0, method="BFGS", jac=gradf, callback=lambda x: scopt_conv.append(x), 
                           tol=tol, options={"maxiter": max_iter})
print("Result: {}".format(scopt_res.message))
if scopt_res.success:
    print("Convergence in {} iterations".format(scopt_res.nit))
print("Function value = {}".format(f(scopt_res.x)))









    



	 Method BB 1
Required tolerance achieved!
Convergence in 62 iterations
Function value = -2.310714124536439
Norm of gradient = 5.976552870133477e-06
	 Method BB 2
Required tolerance achieved!
Convergence in 73 iterations
Function value = -2.310714124510319
Norm of gradient = 7.645687986275001e-06
	 Method BFGS
Required tolerance achieved!
Convergence in 46 iterations
Function value = -2.3107141245352274
Norm of gradient = 8.228357596394661e-06
	 Method DFP
Required tolerance achieved!
Convergence in 97 iterations
Function value = -2.3107141245360507
Norm of gradient = 8.385720237398356e-06
	 Method GD
Maximum iteration exceeds!
Convergence in 100 iterations
Function value = -2.3101833234570632
Norm of gradient = 0.04463472408731432
	 Method LBFGS
Required tolerance achieved!
Convergence in 46 iterations
Function value = -2.3107141245355973
Norm of gradient = 7.694823416577808e-06
	 Method BFGS Scipy
Result: Optimization terminated successfully.
Convergence in 59 iterations
Function value = -2.310714124538122



In [23]:

    
plt.figure(figsize=(12, 8))
fontsize = 26
for m in method:   
    plt.semilogy([np.linalg.norm(gradf(x)) for x in method[m].get_convergence()], label=m)

plt.semilogy([np.linalg.norm(gradf(x)) for x in [x0] + scopt_conv], label='BFGS SciPy')
plt.legend(fontsize=fontsize)
plt.ylabel("$\|f'(x_k)\|_2$", fontsize=fontsize)
plt.xlabel("Number of iterations, $k$", fontsize=fontsize)
plt.xticks(fontsize=fontsize)
_ = plt.yticks(fontsize=fontsize)

Pro & Contra

Pro:

Instead of exact computing of the hessian one uses its estimate, which is obtaned from current hessian approximation, difference between last two points and corresponding gradients
Instead of solving linear system, we have already had estimate of the inverse hessian
Complexity per iteration is $O(n^2) + ...$ compared with с $O(n^3) + ...$ in Newton method
L-BFGS has linear memory complexity by the problem dimension
Self-correcting property of BFGS: if in some iteration hessian approximation is very poor, the few next iterations will improve the quality of approximation
Superlinear convergence to the minimizer of the objective function $f$ (more details see in [1])

Contra:

No universal receipt for chosing $B_0$ and $H_0$
No complete convergence and optimality theory
Not any step size selection method ensures the curvature condition $y^{\top}_ks_k > 0$