Line Search algorithms

In this notebook, we'll review line search algorithms in general, in particular,

  1. discuss requirements for step lengths to result in global convergence
  2. discuss how to choose the step length via an approximate line search algorithm
  3. discuss global convergence and rate of convergence of line search algorithms
  4. walk through a few examples of functions and the result of these line search algorithms on these functions

Code for the algorithms and functions test are available in the following files:


In [4]:
include("../optimizers/linesearch.jl")
include("../utils/functions.jl")
include("../utils/plot.jl")

using Gadfly

Overview

A line search algorithm is one that given a starting point $x_0$ produces iterates $x_{k+1}$ of the form:

$$x_{k+1} = x_k + \alpha_k p_k,$$

where $\alpha_k$ is the step length and $p_k$ is the step direction. Let $f_k := f(x_k)$.

Ideally, we want an algorithm that both results in:

  1. global convergence: i.e., the gradient norms converge to 0 $$ || \nabla f_k || \rightarrow 0 \text{ as } k\rightarrow \infty$$

  2. a fast rate of convergence

In practice, we terminate when the values of the iterates ``stop changing", $||x_{k-1} - x_{k}|| < \epsilon$. But we can also plot, e.g.,

  1. the gradient norms $|| \nabla f_k ||$, to get a sense of global convergence, or
  2. the ratio of the gradient norms of the current and previous iterations $|| \nabla f_{k+1} ||/|| \nabla f_k ||$, to get a sense of the rate of convergence.

The two main important part are to choose the step length and the step direction.

Choosing the step length $\alpha$

To choose the step length at iteration $k$, we could choose $\alpha$ to minimize the value of the function $f(x_k + \alpha p_k)$, but we don't do this in practice because it is too expensive, even though it's only a 1-dimensional line search. Instead, we do what is called an inexact line search, where all we need to do is produce an iterate $x_{k+1}$ that results in a sufficient reduction in the value of the function.

Wolfe conditions

In the inexact line search, we just need to make sure the step lengths are not too long and not too short. To do this, if the step length $\alpha$ satisfies two conditions, known as the Wolfe conditions, then we can usually in praactice do just as well as an exact line search.

The two conditions are the sufficient decrease condition and the curvature requirement.

  1. Sufficient decrease (Armijo condition): $$ f(x_k + \alpha p_k) \leq f(x_k) + c_1 \alpha \nabla f_k^\top p_k.$$
  2. Curvature condition: $$ \nabla f(x_k + \alpha p_k)^\top p_k \geq c_2 \nabla f_k^\top p_k.$$

There's also a modified curvature condition, which together with the sufficient decrease condition is known as the strong Wolfe conditions, in which $$ |\nabla f(x_k + \alpha p_k)^\top p_k| \leq c_2 |\nabla f_k^\top p_k|.$$

In both weak and strong cases, we require $0 < c_1 < c_2 < 1$.

In fact, we know that with some assumptions on the function, there always exist intervals of $\alpha$ for which the (weak or strong) Wolfe conditions are always satisfied.

Goldstein conditions

An alternative to the Wolfe conditions that also make sure step lengths are not too long or short are the Goldstien conditions. They are $$ f(x_k) + (1-c) \alpha \nabla f_k^\top p_k \leq f(x_k + \alpha p_k) \leq f(x_k) + c \alpha \nabla f_k^\top p_k,$$ where $c \in (0, \frac{1}{2})$.

Note that $\nabla f_k^\top p_k$ is the directional derivative.

A downside of the Goldstein conditions is that they might exclude all minimizers along $p_k$.

Backtracking algorithm

An algorithm that works pretty well for picking $\alpha$ is the Armijo backtracking algorithm.

Here we run this algorithm in every iteration to choose the specific $\alpha_k$ we want to use.

We start with some initial value of $\alpha > 0$ (e.g., ideally some larger value that is too long) and then slowly decay the value of $\alpha$ until the Armijo sufficient decrease condition $$ f(x_k + \alpha_k x_k) \leq f(x_k) + c_1 \alpha_k \nabla f_k^\top p_k$$ is satisfied.

More sophisticated inexact line searches that lead to faster convergence based on, e.g., quadratic interpolation.

We implement the basic backtracking algorithm for all line search algorithms described in this notebook.

Step directions

We now describe several different types of step directions $p_k$, which are what differentiates different named line search algorithms.

Steepest descent

Here we just step in the direction of the gradient $p_k = -\nabla f(x_k)$.

Newton's method

Here the step is $p_k = -\nabla^2 f(x_k)^{-1} \nabla f(x_k).$ This is the negative gradient multiplied by the exact Hessian.

Indefinite Hessian corrections

If the Hessian $\nabla^2 f(x_k)^{-1}$ is not positive definite, then $p_k$ is not a descent direction. If $f$ is not a convex function, then this will likely happen if we are not close enough to the optimal solution.

To account for this, we make a correction to the Hessian by adding a small multiple of the identity to the diagonal, trying small values until we get something sufficiently positive definite.

Other ways to correct for a non-positive definite Hessian include a modified Cholesky factorizsation or modified indefinite factorization method, but here we stick to the most basic type of method that is practical.

Quasi-Newton methods

Here instead of computing the exact Hessian at each step, the Hessian is approximated with a matrix $B_k$ and updated in each iteration.

So, the descent direction is of the form $$p_k = -B_k \nabla f(x_k),$$ where $B_k$ is a positive definite matrix.

Secant condition

BFGS

In the BFGS algorithm, the matrix $B_k$ is always positive definite, so we don't have to worry about indefinite corrections.

The Armijo backtracking algorithm might result in step sizes with negative curvature. We can account for this by using a dampened BFGS algorithm.

Inexact Newton methods and conjugate gradient

In Newton's method, we find the search direction $p_k$ by solving the linear system $$\nabla^2 f_k p_k = -\nabla f_k.$$ Conjugate gradient methods can solve a large linear system inexpensively, and can be turned into a line search method. We implement the Newton-CG method (aka truncated Newton method), in which we compute the search direction by applying the CG method to the above Newton equation and attempt to satisfy a termination test of the form $$||r_k || \leq \eta_k ||\nabla f_k||,$$ where $r_k$ is the residual at step $k$ and $\eta_k \in (0,1)$.

Convergence of line search algorithms

Global convergence of line search algorithms

Zoutendijk's theorem tells us when we get global convergence in line search algorithms.

In particular, it says that under the conditions

  1. $f$ is continuously differentiable in an open set $N$ containing $\{x : f(x) \leq f(x_0)\}$.
  2. the gradient of $f$ is Lipschitz continuous on $N$, i.e., for all $x,x' \in N, ||\nabla f(x) - \nabla f(x')|| \leq L \,||x - x'||$.
  3. $p_k$ is a descent direction, i.e., $\nabla f_k^\top p_k$
  4. $\alpha_k$ satisfies the Wolfe conditions (or Goldstein conditions)
  5. $f$ is bounded below in $\mathbb{R}^n$ (so it doesn't go off to $-\infty)$

that we have $$ \sum_{k=0}^\infty ||\nabla f_k||^2 \cos^2(\theta_k) <\infty,$$ where $\theta_k$ is the the angle between the descent direction $p_k$ and the negative gradient $-\nabla_k$, which is also the steepest descent direction. The angle is defined as: $$\theta_k = \frac{ -\nabla f_k^\top p_k}{||\nabla f_k|| \, ||p_k||}.$$

This implies that $||f_k|| \rightarrow 0$ (global convergence) if the angle of the descent direction with the negative gradient (i.e., the direction of steepest descent) is bounded away from 90 degrees.

TL;DR: any descent direction $p_k$ that is not too far away from the steepest descent direction leads to a globally convergent iteration.

Rate of convergence

Though the steepest descent method is the "quintessential" globally convergent algorithms, in that the direction is the negative gradient (i.e., the angle $\theta_k = 0$), it has a very slow rate of convergence (linear). In constrast Newton's method and quasi-Newton methods have a much faster rate of convergence.

We'll see this in the example functions below -- we'll only plot the gradient norm ratios, which tell us whether we have linear or superlinear convergence. If the ratios go to 0, then we have superlinear convergence. (To see quadratic convergence, we'd look at the gradient norm divided by the square of the previous iteration's gradient norm. We don't plot this, but we'll observe that Newton converges much faster than the other methods.)

Summary of convergence rates:

  1. Steepest descent (linear)
  2. Newton's method (quadratic)
  3. Quasi-Newton (usually superlinear)
  4. Conjugate gradient (superlinear)

Testing the optimization methods


In [12]:
# general function to plot the output of the algorithms
function plot_values(fx1, fx2, fx3, fx4, xlab, ylab, title)
    nsamps = length(fx1)
    nsamps2 = length(fx2)
    nsamps3 = length(fx3)
    nsamps4 = length(fx4)
    
    Gadfly.plot(layer(x=1:nsamps, y=fx1, Geom.line, 
            Theme(default_color=color("blue"))),
        layer(x=1:nsamps2, y=fx2, Geom.line, 
            Theme(default_color=color("red"))),
        layer(x=1:nsamps3, y=fx3, Geom.line, 
            Theme(default_color=color("orange"))),
        layer(x=1:nsamps4, y=fx4, Geom.line, 
            Theme(default_color=color("purple"))),
        Guide.xlabel(xlab), Guide.ylabel(ylab), 
        Guide.title(title),
        Guide.manual_color_key("Legend", 
            ["Newton", "steepest", "Newton-CG", "BFGS"], 
            ["blue", "red", "orange", "purple"]),
        Scale.x_log10, Scale.y_log10)
end


Out[12]:
plot_values (generic function with 1 method)

Fenton's function


In [5]:
@time xvals = line_search(fenton, [3.;4.], 
    fenton_g, fenton_h, "newton", 1000);
@time svals = line_search(fenton, [3.;4.], 
    fenton_g, fenton_h, "steepest", 1000);
@time cvals = line_search(fenton, [3.;4.], 
    fenton_g, fenton_h, "newton_CG", 1000);
@time qvals = line_search(fenton, [3.;4.], 
    fenton_g, fenton_h, "BFGS", 1000);


Using method newton
Number of indefinite fixes 0
Number of iterations 7
  2.345913 seconds (1.72 M allocations: 78.202 MB, 1.18% gc time)
Using method steepest
100
200
300
Number of indefinite fixes 0
Number of iterations 376
  0.034531 seconds (42.26 k allocations: 1.353 MB)
Using method newton_CG
Number of indefinite fixes 0
Number of iterations 10
  0.806269 seconds (1.02 M allocations: 45.587 MB, 1.53% gc time)
Using method BFGS
Number of indefinite fixes 0
Number of iterations 13
  0.116728 seconds (91.09 k allocations: 3.965 MB)

In [7]:
nsamps = length(xvals)
nsamps2 = length(svals)
nsamps3 = length(cvals)
nsamps4 = length(qvals)


fx = [fenton(xvals[i]) for i in 1:nsamps]
fx2 = [fenton(svals[i]) for i in 1:nsamps2]
fx3 = [fenton(cvals[i]) for i in 1:nsamps3]
fx4 = [fenton(qvals[i]) for i in 1:nsamps4]

plot_values(fx, fx2, fx3, fx4, "iteration", "f(x)", "Value of function")


Out[7]:
iteration 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106 107 10-3.0 10-2.9 10-2.8 10-2.7 10-2.6 10-2.5 10-2.4 10-2.3 10-2.2 10-2.1 10-2.0 10-1.9 10-1.8 10-1.7 10-1.6 10-1.5 10-1.4 10-1.3 10-1.2 10-1.1 10-1.0 10-0.9 10-0.8 10-0.7 10-0.6 10-0.5 10-0.4 10-0.3 10-0.2 10-0.1 100.0 100.1 100.2 100.3 100.4 100.5 100.6 100.7 100.8 100.9 101.0 101.1 101.2 101.3 101.4 101.5 101.6 101.7 101.8 101.9 102.0 102.1 102.2 102.3 102.4 102.5 102.6 102.7 102.8 102.9 103.0 103.1 103.2 103.3 103.4 103.5 103.6 103.7 103.8 103.9 104.0 104.1 104.2 104.3 104.4 104.5 104.6 104.7 104.8 104.9 105.0 105.1 105.2 105.3 105.4 105.5 105.6 105.7 105.8 105.9 106.0 10-3 100 103 106 10-3.0 10-2.8 10-2.6 10-2.4 10-2.2 10-2.0 10-1.8 10-1.6 10-1.4 10-1.2 10-1.0 10-0.8 10-0.6 10-0.4 10-0.2 100.0 100.2 100.4 100.6 100.8 101.0 101.2 101.4 101.6 101.8 102.0 102.2 102.4 102.6 102.8 103.0 103.2 103.4 103.6 103.8 104.0 104.2 104.4 104.6 104.8 105.0 105.2 105.4 105.6 105.8 106.0 Newton steepest Newton-CG BFGS Legend 100.10 100.12 100.14 100.16 100.18 100.20 100.22 100.24 100.26 100.28 100.30 100.32 100.34 100.36 100.38 100.40 100.42 100.44 100.46 100.48 100.50 100.000 100.005 100.010 100.015 100.020 100.025 100.030 100.035 100.040 100.045 100.050 100.055 100.060 100.065 100.070 100.075 100.080 100.085 100.090 100.095 100.100 100.105 100.110 100.115 100.120 100.125 100.130 100.135 100.140 100.145 100.150 100.155 100.160 100.165 100.170 100.175 100.180 100.185 100.190 100.195 100.200 100.205 100.210 100.215 100.220 100.225 100.230 100.235 100.240 100.245 100.250 100.255 100.260 100.265 100.270 100.275 100.280 100.285 100.290 100.295 100.300 100.305 100.310 100.315 100.320 100.325 100.330 100.335 100.340 100.345 100.350 100.355 100.360 100.365 100.370 100.375 100.380 100.385 100.390 100.395 100.400 100.405 100.410 100.415 100.420 100.425 100.430 100.435 100.440 100.445 100.450 100.455 100.460 100.465 100.470 100.475 100.480 100.1 100.2 100.3 100.4 100.5 100.00 100.01 100.02 100.03 100.04 100.05 100.06 100.07 100.08 100.09 100.10 100.11 100.12 100.13 100.14 100.15 100.16 100.17 100.18 100.19 100.20 100.21 100.22 100.23 100.24 100.25 100.26 100.27 100.28 100.29 100.30 100.31 100.32 100.33 100.34 100.35 100.36 100.37 100.38 100.39 100.40 100.41 100.42 100.43 100.44 100.45 100.46 100.47 100.48 f(x) Value of function

In [9]:
grads = [norm(fenton_g(xvals[i]), 2) for i in 1:nsamps]
grads2 = [norm(fenton_g(svals[i]), 2) for i in 1:nsamps2]
grads3 = [norm(fenton_g(cvals[i]), 2) for i in 1:nsamps3]
grads4 = [norm(fenton_g(qvals[i]), 2) for i in 1:nsamps4]

r1 = compute_grad_ratio(grads)
r2 = compute_grad_ratio(grads2)
r3 = compute_grad_ratio(grads3)
r4 = compute_grad_ratio(grads4)

plot_values(r1, r2, r3, r4, "iteration", 
    "gradient norm ratios", "gradient norm ratios")


Out[9]:
iteration 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106 107 10-3.0 10-2.9 10-2.8 10-2.7 10-2.6 10-2.5 10-2.4 10-2.3 10-2.2 10-2.1 10-2.0 10-1.9 10-1.8 10-1.7 10-1.6 10-1.5 10-1.4 10-1.3 10-1.2 10-1.1 10-1.0 10-0.9 10-0.8 10-0.7 10-0.6 10-0.5 10-0.4 10-0.3 10-0.2 10-0.1 100.0 100.1 100.2 100.3 100.4 100.5 100.6 100.7 100.8 100.9 101.0 101.1 101.2 101.3 101.4 101.5 101.6 101.7 101.8 101.9 102.0 102.1 102.2 102.3 102.4 102.5 102.6 102.7 102.8 102.9 103.0 103.1 103.2 103.3 103.4 103.5 103.6 103.7 103.8 103.9 104.0 104.1 104.2 104.3 104.4 104.5 104.6 104.7 104.8 104.9 105.0 105.1 105.2 105.3 105.4 105.5 105.6 105.7 105.8 105.9 106.0 10-3 100 103 106 10-3.0 10-2.8 10-2.6 10-2.4 10-2.2 10-2.0 10-1.8 10-1.6 10-1.4 10-1.2 10-1.0 10-0.8 10-0.6 10-0.4 10-0.2 100.0 100.2 100.4 100.6 100.8 101.0 101.2 101.4 101.6 101.8 102.0 102.2 102.4 102.6 102.8 103.0 103.2 103.4 103.6 103.8 104.0 104.2 104.4 104.6 104.8 105.0 105.2 105.4 105.6 105.8 106.0 Newton steepest Newton-CG BFGS Legend 10-12 10-11 10-10 10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106 107 108 10-11.0 10-10.8 10-10.6 10-10.4 10-10.2 10-10.0 10-9.8 10-9.6 10-9.4 10-9.2 10-9.0 10-8.8 10-8.6 10-8.4 10-8.2 10-8.0 10-7.8 10-7.6 10-7.4 10-7.2 10-7.0 10-6.8 10-6.6 10-6.4 10-6.2 10-6.0 10-5.8 10-5.6 10-5.4 10-5.2 10-5.0 10-4.8 10-4.6 10-4.4 10-4.2 10-4.0 10-3.8 10-3.6 10-3.4 10-3.2 10-3.0 10-2.8 10-2.6 10-2.4 10-2.2 10-2.0 10-1.8 10-1.6 10-1.4 10-1.2 10-1.0 10-0.8 10-0.6 10-0.4 10-0.2 100.0 100.2 100.4 100.6 100.8 101.0 101.2 101.4 101.6 101.8 102.0 102.2 102.4 102.6 102.8 103.0 103.2 103.4 103.6 103.8 104.0 104.2 104.4 104.6 104.8 105.0 105.2 105.4 105.6 105.8 106.0 106.2 106.4 106.6 106.8 107.0 10-20 10-10 100 1010 10-11.0 10-10.5 10-10.0 10-9.5 10-9.0 10-8.5 10-8.0 10-7.5 10-7.0 10-6.5 10-6.0 10-5.5 10-5.0 10-4.5 10-4.0 10-3.5 10-3.0 10-2.5 10-2.0 10-1.5 10-1.0 10-0.5 100.0 100.5 101.0 101.5 102.0 102.5 103.0 103.5 104.0 104.5 105.0 105.5 106.0 106.5 107.0 gradient norm ratios gradient norm ratios

In [10]:
plot_values(grads, grads2, grads3, grads4, "iteration", 
    "gradient norms", "gradient norms")


Out[10]:
iteration 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106 107 10-3.0 10-2.9 10-2.8 10-2.7 10-2.6 10-2.5 10-2.4 10-2.3 10-2.2 10-2.1 10-2.0 10-1.9 10-1.8 10-1.7 10-1.6 10-1.5 10-1.4 10-1.3 10-1.2 10-1.1 10-1.0 10-0.9 10-0.8 10-0.7 10-0.6 10-0.5 10-0.4 10-0.3 10-0.2 10-0.1 100.0 100.1 100.2 100.3 100.4 100.5 100.6 100.7 100.8 100.9 101.0 101.1 101.2 101.3 101.4 101.5 101.6 101.7 101.8 101.9 102.0 102.1 102.2 102.3 102.4 102.5 102.6 102.7 102.8 102.9 103.0 103.1 103.2 103.3 103.4 103.5 103.6 103.7 103.8 103.9 104.0 104.1 104.2 104.3 104.4 104.5 104.6 104.7 104.8 104.9 105.0 105.1 105.2 105.3 105.4 105.5 105.6 105.7 105.8 105.9 106.0 10-3 100 103 106 10-3.0 10-2.8 10-2.6 10-2.4 10-2.2 10-2.0 10-1.8 10-1.6 10-1.4 10-1.2 10-1.0 10-0.8 10-0.6 10-0.4 10-0.2 100.0 100.2 100.4 100.6 100.8 101.0 101.2 101.4 101.6 101.8 102.0 102.2 102.4 102.6 102.8 103.0 103.2 103.4 103.6 103.8 104.0 104.2 104.4 104.6 104.8 105.0 105.2 105.4 105.6 105.8 106.0 Newton steepest Newton-CG BFGS Legend 10-40 10-35 10-30 10-25 10-20 10-15 10-10 10-5 100 105 1010 1015 1020 1025 1030 10-35 10-34 10-33 10-32 10-31 10-30 10-29 10-28 10-27 10-26 10-25 10-24 10-23 10-22 10-21 10-20 10-19 10-18 10-17 10-16 10-15 10-14 10-13 10-12 10-11 10-10 10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106 107 108 109 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 10-40 10-20 100 1020 1040 10-35 10-30 10-25 10-20 10-15 10-10 10-5 100 105 1010 1015 1020 1025 gradient norms gradient norms

Rosenbrock function

The Rosenbrock function is defined by: $$f(x) = \sum_{i=1}^n [ (1-x_{2i-1})^2 + 10(x_{2i} - x_{2i-1}^2)^2 ]$$


In [13]:
@time xvals = line_search(rosenbrock, ones(100)*3, rosenbrock_g, rosenbrock_h, 
    "newton", 2000, 1e-8);
@time svals = line_search(rosenbrock, ones(100)*3, rosenbrock_g, rosenbrock_h, 
    "steepest", 5000, 1e-8);
@time cvals = line_search(rosenbrock, ones(100)*3, rosenbrock_g, rosenbrock_h, 
    "newton_CG", 2000, 1e-8);
@time qvals = line_search(rosenbrock, ones(100)*3, rosenbrock_g, rosenbrock_h, 
    "BFGS", 2000, 1e-8);


Using method newton
Number of indefinite fixes 0
Number of iterations 12
  3.791919 seconds (49.75 M allocations: 767.828 MB, 5.08% gc time)
Using method steepest
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
Number of indefinite fixes 0
Number of iterations 3064
 11.670348 seconds (144.08 M allocations: 2.444 GB, 2.82% gc time)
Using method newton_CG
Number of indefinite fixes 0
Number of iterations 15
  4.140566 seconds (62.13 M allocations: 957.349 MB, 2.32% gc time)
Using method BFGS
Number of indefinite fixes 0
Number of iterations 43
  0.470065 seconds (6.73 M allocations: 139.479 MB, 3.37% gc time)

In [15]:
func = rosenbrock
func_g = rosenbrock_g

nsamps = length(xvals)
nsamps2 = length(svals)
nsamps3 = length(cvals)
nsamps4 = length(qvals)


fx = [func(xvals[i]) for i in 1:nsamps]
fx2 = [func(svals[i]) for i in 1:nsamps2]
fx3 = [func(cvals[i]) for i in 1:nsamps3]
fx4 = [func(qvals[i]) for i in 1:nsamps4]

plot_values(fx, fx2, fx3, fx4, "iteration", "f(x)", "Value of function")


Out[15]:
iteration 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106 107 108 109 10-4.0 10-3.8 10-3.6 10-3.4 10-3.2 10-3.0 10-2.8 10-2.6 10-2.4 10-2.2 10-2.0 10-1.8 10-1.6 10-1.4 10-1.2 10-1.0 10-0.8 10-0.6 10-0.4 10-0.2 100.0 100.2 100.4 100.6 100.8 101.0 101.2 101.4 101.6 101.8 102.0 102.2 102.4 102.6 102.8 103.0 103.2 103.4 103.6 103.8 104.0 104.2 104.4 104.6 104.8 105.0 105.2 105.4 105.6 105.8 106.0 106.2 106.4 106.6 106.8 107.0 107.2 107.4 107.6 107.8 108.0 10-5 100 105 1010 10-4.0 10-3.5 10-3.0 10-2.5 10-2.0 10-1.5 10-1.0 10-0.5 100.0 100.5 101.0 101.5 102.0 102.5 103.0 103.5 104.0 104.5 105.0 105.5 106.0 106.5 107.0 107.5 108.0 Newton steepest Newton-CG BFGS Legend 10-50 10-45 10-40 10-35 10-30 10-25 10-20 10-15 10-10 10-5 100 105 1010 1015 1020 1025 1030 1035 10-45 10-44 10-43 10-42 10-41 10-40 10-39 10-38 10-37 10-36 10-35 10-34 10-33 10-32 10-31 10-30 10-29 10-28 10-27 10-26 10-25 10-24 10-23 10-22 10-21 10-20 10-19 10-18 10-17 10-16 10-15 10-14 10-13 10-12 10-11 10-10 10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106 107 108 109 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 10-50 100 1050 10-46 10-44 10-42 10-40 10-38 10-36 10-34 10-32 10-30 10-28 10-26 10-24 10-22 10-20 10-18 10-16 10-14 10-12 10-10 10-8 10-6 10-4 10-2 100 102 104 106 108 1010 1012 1014 1016 1018 1020 1022 1024 1026 1028 1030 f(x) Value of function

In [17]:
nsamps = length(xvals)

grads = [norm(func_g(xvals[i]), 2) for i in 1:nsamps]
grads2 = [norm(func_g(svals[i]), 2) for i in 1:nsamps2]
grads3 = [norm(func_g(cvals[i]), 2) for i in 1:nsamps3]
grads4 = [norm(func_g(qvals[i]), 2) for i in 1:nsamps4]

r1 = compute_grad_ratio(grads)
r2 = compute_grad_ratio(grads2)
r3 = compute_grad_ratio(grads3)
r4 = compute_grad_ratio(grads4)

plot_values(r1, r2, r3, r4, "iteration", "gradient norm ratios", 
    "gradient norm ratios")


Out[17]:
iteration 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106 107 108 109 10-4.0 10-3.8 10-3.6 10-3.4 10-3.2 10-3.0 10-2.8 10-2.6 10-2.4 10-2.2 10-2.0 10-1.8 10-1.6 10-1.4 10-1.2 10-1.0 10-0.8 10-0.6 10-0.4 10-0.2 100.0 100.2 100.4 100.6 100.8 101.0 101.2 101.4 101.6 101.8 102.0 102.2 102.4 102.6 102.8 103.0 103.2 103.4 103.6 103.8 104.0 104.2 104.4 104.6 104.8 105.0 105.2 105.4 105.6 105.8 106.0 106.2 106.4 106.6 106.8 107.0 107.2 107.4 107.6 107.8 108.0 10-5 100 105 1010 10-4.0 10-3.5 10-3.0 10-2.5 10-2.0 10-1.5 10-1.0 10-0.5 100.0 100.5 101.0 101.5 102.0 102.5 103.0 103.5 104.0 104.5 105.0 105.5 106.0 106.5 107.0 107.5 108.0 Newton steepest Newton-CG BFGS Legend 10-16 10-14 10-12 10-10 10-8 10-6 10-4 10-2 100 102 104 106 108 1010 1012 10-14.0 10-13.5 10-13.0 10-12.5 10-12.0 10-11.5 10-11.0 10-10.5 10-10.0 10-9.5 10-9.0 10-8.5 10-8.0 10-7.5 10-7.0 10-6.5 10-6.0 10-5.5 10-5.0 10-4.5 10-4.0 10-3.5 10-3.0 10-2.5 10-2.0 10-1.5 10-1.0 10-0.5 100.0 100.5 101.0 101.5 102.0 102.5 103.0 103.5 104.0 104.5 105.0 105.5 106.0 106.5 107.0 107.5 108.0 108.5 109.0 109.5 1010.0 10-20 10-10 100 1010 10-14.0 10-13.5 10-13.0 10-12.5 10-12.0 10-11.5 10-11.0 10-10.5 10-10.0 10-9.5 10-9.0 10-8.5 10-8.0 10-7.5 10-7.0 10-6.5 10-6.0 10-5.5 10-5.0 10-4.5 10-4.0 10-3.5 10-3.0 10-2.5 10-2.0 10-1.5 10-1.0 10-0.5 100.0 100.5 101.0 101.5 102.0 102.5 103.0 103.5 104.0 104.5 105.0 105.5 106.0 106.5 107.0 107.5 108.0 108.5 109.0 109.5 1010.0 f(x) Value of function

In [19]:
plot_values(grads, grads2, grads3, grads4, "iteration", "gradient norms", 
    "gradient norms")


Out[19]:
iteration 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106 107 108 109 10-4.0 10-3.8 10-3.6 10-3.4 10-3.2 10-3.0 10-2.8 10-2.6 10-2.4 10-2.2 10-2.0 10-1.8 10-1.6 10-1.4 10-1.2 10-1.0 10-0.8 10-0.6 10-0.4 10-0.2 100.0 100.2 100.4 100.6 100.8 101.0 101.2 101.4 101.6 101.8 102.0 102.2 102.4 102.6 102.8 103.0 103.2 103.4 103.6 103.8 104.0 104.2 104.4 104.6 104.8 105.0 105.2 105.4 105.6 105.8 106.0 106.2 106.4 106.6 106.8 107.0 107.2 107.4 107.6 107.8 108.0 10-5 100 105 1010 10-4.0 10-3.5 10-3.0 10-2.5 10-2.0 10-1.5 10-1.0 10-0.5 100.0 100.5 101.0 101.5 102.0 102.5 103.0 103.5 104.0 104.5 105.0 105.5 106.0 106.5 107.0 107.5 108.0 Newton steepest Newton-CG BFGS Legend 10-40 10-35 10-30 10-25 10-20 10-15 10-10 10-5 100 105 1010 1015 1020 1025 1030 10-35 10-34 10-33 10-32 10-31 10-30 10-29 10-28 10-27 10-26 10-25 10-24 10-23 10-22 10-21 10-20 10-19 10-18 10-17 10-16 10-15 10-14 10-13 10-12 10-11 10-10 10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106 107 108 109 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 10-40 10-20 100 1020 1040 10-35 10-30 10-25 10-20 10-15 10-10 10-5 100 105 1010 1015 1020 1025 gradient norms gradient norms

Cute function

The cute function is given by $f: \mathbb{R}^n \rightarrow \mathbb{R}$ $$ f(x) = \sum_{i=1}^{n-4} (-4x_i+3)^2 + (x_i^2 + 2x_{i+1}^2 + 3x_{i+2}^2 + 4 x_{i+3}^2 + 5x_n^2)^2 $$


In [20]:
@time xvals = line_search(cute, ones(100)*10, cute_g, cute_h, "newton", 
    2000, 1e-8);
@time svals = line_search(cute, ones(100)*10, cute_g, cute_h, "steepest", 
    2000, 1e-8);
@time cvals = line_search(cute, ones(100)*10, cute_g, cute_h, "newton_CG", 
    2000, 1e-8);
@time qvals = line_search(cute, ones(100)*10, cute_g, cute_h, "BFGS", 
    2000, 1e-8);


Using method newton
Number of indefinite fixes 0
Number of iterations 16
 16.224665 seconds (126.05 M allocations: 1.889 GB, 3.86% gc time)
Using method steepest
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
Finished algorithm without converging.
 26.568222 seconds (187.94 M allocations: 3.078 GB, 5.12% gc time)
Using method newton_CG
Number of indefinite fixes 0
Number of iterations 21
 20.689629 seconds (165.43 M allocations: 2.483 GB, 3.66% gc time)
Using method BFGS
100
Number of indefinite fixes 0

In [21]:
func = cute
func_g = cute_g

nsamps = length(xvals)
nsamps2 = length(svals)
nsamps3 = length(cvals)
nsamps4 = length(qvals)

fx = [func(xvals[i]) for i in 1:nsamps]
fx2 = [func(svals[i]) for i in 1:nsamps2]
fx3 = [func(cvals[i]) for i in 1:nsamps3]
fx4 = [func(qvals[i]) for i in 1:nsamps4]


plot_values(fx, fx2, fx3, fx4, "iteration", "f(x)", "Value of function")


Out[21]:
iteration 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106 107 108 109 10-4.0 10-3.8 10-3.6 10-3.4 10-3.2 10-3.0 10-2.8 10-2.6 10-2.4 10-2.2 10-2.0 10-1.8 10-1.6 10-1.4 10-1.2 10-1.0 10-0.8 10-0.6 10-0.4 10-0.2 100.0 100.2 100.4 100.6 100.8 101.0 101.2 101.4 101.6 101.8 102.0 102.2 102.4 102.6 102.8 103.0 103.2 103.4 103.6 103.8 104.0 104.2 104.4 104.6 104.8 105.0 105.2 105.4 105.6 105.8 106.0 106.2 106.4 106.6 106.8 107.0 107.2 107.4 107.6 107.8 108.0 10-5 100 105 1010 10-4.0 10-3.5 10-3.0 10-2.5 10-2.0 10-1.5 10-1.0 10-0.5 100.0 100.5 101.0 101.5 102.0 102.5 103.0 103.5 104.0 104.5 105.0 105.5 106.0 106.5 107.0 107.5 108.0 Newton steepest Newton-CG BFGS Legend 10-12 10-10 10-8 10-6 10-4 10-2 100 102 104 106 108 1010 1012 1014 1016 1018 1020 1022 10-10.0 10-9.5 10-9.0 10-8.5 10-8.0 10-7.5 10-7.0 10-6.5 10-6.0 10-5.5 10-5.0 10-4.5 10-4.0 10-3.5 10-3.0 10-2.5 10-2.0 10-1.5 10-1.0 10-0.5 100.0 100.5 101.0 101.5 102.0 102.5 103.0 103.5 104.0 104.5 105.0 105.5 106.0 106.5 107.0 107.5 108.0 108.5 109.0 109.5 1010.0 1010.5 1011.0 1011.5 1012.0 1012.5 1013.0 1013.5 1014.0 1014.5 1015.0 1015.5 1016.0 1016.5 1017.0 1017.5 1018.0 1018.5 1019.0 1019.5 1020.0 10-10 100 1010 1020 10-10 10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106 107 108 109 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 f(x) Value of function
Number of iterations 131
  2.744460 seconds (23.09 M allocations: 464.262 MB, 3.43% gc time)

In [22]:
grads = [norm(func_g(xvals[i]), 2) for i in 1:nsamps]
grads2 = [norm(func_g(svals[i]), 2) for i in 1:nsamps2]
grads3 = [norm(func(cvals[i]), 2) for i in 1:nsamps3]
grads4 = [norm(func_g(qvals[i]), 2) for i in 1:nsamps4]

r1 = compute_grad_ratio(grads)
r2 = compute_grad_ratio(grads2)
r3 = compute_grad_ratio(grads3)
r4 = compute_grad_ratio(grads4)

plot_values(r1, r2, r3, r4, "iteration", "gradient norm ratio", 
    "gradient norm ratios")


Out[22]:
iteration 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106 107 108 109 10-4.0 10-3.8 10-3.6 10-3.4 10-3.2 10-3.0 10-2.8 10-2.6 10-2.4 10-2.2 10-2.0 10-1.8 10-1.6 10-1.4 10-1.2 10-1.0 10-0.8 10-0.6 10-0.4 10-0.2 100.0 100.2 100.4 100.6 100.8 101.0 101.2 101.4 101.6 101.8 102.0 102.2 102.4 102.6 102.8 103.0 103.2 103.4 103.6 103.8 104.0 104.2 104.4 104.6 104.8 105.0 105.2 105.4 105.6 105.8 106.0 106.2 106.4 106.6 106.8 107.0 107.2 107.4 107.6 107.8 108.0 10-5 100 105 1010 10-4.0 10-3.5 10-3.0 10-2.5 10-2.0 10-1.5 10-1.0 10-0.5 100.0 100.5 101.0 101.5 102.0 102.5 103.0 103.5 104.0 104.5 105.0 105.5 106.0 106.5 107.0 107.5 108.0 Newton steepest Newton-CG BFGS Legend 10-12 10-11 10-10 10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106 107 108 10-11.0 10-10.8 10-10.6 10-10.4 10-10.2 10-10.0 10-9.8 10-9.6 10-9.4 10-9.2 10-9.0 10-8.8 10-8.6 10-8.4 10-8.2 10-8.0 10-7.8 10-7.6 10-7.4 10-7.2 10-7.0 10-6.8 10-6.6 10-6.4 10-6.2 10-6.0 10-5.8 10-5.6 10-5.4 10-5.2 10-5.0 10-4.8 10-4.6 10-4.4 10-4.2 10-4.0 10-3.8 10-3.6 10-3.4 10-3.2 10-3.0 10-2.8 10-2.6 10-2.4 10-2.2 10-2.0 10-1.8 10-1.6 10-1.4 10-1.2 10-1.0 10-0.8 10-0.6 10-0.4 10-0.2 100.0 100.2 100.4 100.6 100.8 101.0 101.2 101.4 101.6 101.8 102.0 102.2 102.4 102.6 102.8 103.0 103.2 103.4 103.6 103.8 104.0 104.2 104.4 104.6 104.8 105.0 105.2 105.4 105.6 105.8 106.0 106.2 106.4 106.6 106.8 107.0 10-20 10-10 100 1010 10-11.0 10-10.5 10-10.0 10-9.5 10-9.0 10-8.5 10-8.0 10-7.5 10-7.0 10-6.5 10-6.0 10-5.5 10-5.0 10-4.5 10-4.0 10-3.5 10-3.0 10-2.5 10-2.0 10-1.5 10-1.0 10-0.5 100.0 100.5 101.0 101.5 102.0 102.5 103.0 103.5 104.0 104.5 105.0 105.5 106.0 106.5 107.0 gradient norm ratio gradient norm ratios

In [23]:
plot_values(grads, grads2, grads3, grads4, "iteration", "gradient norm", 
    "gradient norms")


Out[23]:
iteration 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106 107 108 109 10-4.0 10-3.8 10-3.6 10-3.4 10-3.2 10-3.0 10-2.8 10-2.6 10-2.4 10-2.2 10-2.0 10-1.8 10-1.6 10-1.4 10-1.2 10-1.0 10-0.8 10-0.6 10-0.4 10-0.2 100.0 100.2 100.4 100.6 100.8 101.0 101.2 101.4 101.6 101.8 102.0 102.2 102.4 102.6 102.8 103.0 103.2 103.4 103.6 103.8 104.0 104.2 104.4 104.6 104.8 105.0 105.2 105.4 105.6 105.8 106.0 106.2 106.4 106.6 106.8 107.0 107.2 107.4 107.6 107.8 108.0 10-5 100 105 1010 10-4.0 10-3.5 10-3.0 10-2.5 10-2.0 10-1.5 10-1.0 10-0.5 100.0 100.5 101.0 101.5 102.0 102.5 103.0 103.5 104.0 104.5 105.0 105.5 106.0 106.5 107.0 107.5 108.0 Newton steepest Newton-CG BFGS Legend 10-35 10-30 10-25 10-20 10-15 10-10 10-5 100 105 1010 1015 1020 1025 1030 1035 10-30 10-29 10-28 10-27 10-26 10-25 10-24 10-23 10-22 10-21 10-20 10-19 10-18 10-17 10-16 10-15 10-14 10-13 10-12 10-11 10-10 10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106 107 108 109 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 10-40 10-20 100 1020 1040 10-30 10-28 10-26 10-24 10-22 10-20 10-18 10-16 10-14 10-12 10-10 10-8 10-6 10-4 10-2 100 102 104 106 108 1010 1012 1014 1016 1018 1020 1022 1024 1026 1028 1030 gradient norm gradient norms

In [ ]:


In [ ]: