Trust region methods

In this notebook we'll discuss what a general trust region method is and algorithms for approximately solving the trust region problem based on the Cauchy point. We'll also discuss a conjugate-gradient trust region algorithm.

Code for the algorithms and functions test are available in the following files:



In [5]:

    
include("../optimizers/trustregion.jl")
include("../utils/functions.jl")
using Gadfly

Notation

Let $f_k := f(x_k)$ be the value of the function at iterate $x_k$, and let $g_k := \nabla f_k$, be the value of the gradient at iterate $x_k$.

Trust region overview

Recall the quadratic model: $$m_k(p) = f_k + g_k^\top p + p^\top B_k p,$$ which comes from the second-order Taylor expansion of $f$.

Here $B_k$ is some symmetric matrix. When $B_k$ is the exact Hessian, we get the trust region Newton method.

We first solve the trust region subproblem: $$\min_p m_k(p) \text{ s.t. } ||p|| \leq \Delta_k,$$ where $\Delta_k$ is the trust region radius. In practice, we solve the subproblem approximately -- see the next section for more details.

We define the following ratio at each iteration $k$: $$ \rho_k = \frac{f(x_k) - f(x_k + p_k) }{ m_k(0) - m_k(p) },$$ where the numerator gives the actual reduction, difference of the function at the current step and the next step, and the denominator is the predicted reduction, i.e., the reduction in $f$ predicted by the model function.

The predicted reduction $m_k(0) - m_k(p)$ is always nonnegative. So, the value of the actual reduction tells us how to adjust the trust region model. If the actual reduction $$f(x_k) - f(x_k + p_k) < 0,$$ then the new objective value $f(x_k + p_k)$ is greater than the current value of $f(x_k)$, so this step should be rejected.

Otherwise, if the actual reduction is highly positive, we can increase the trust region. If it's close to zero or negative, i.e., the model in that region doesn't match the function well, we want to decrease the trust region radius. If the ratio is close to 1, that means the model agrees well with the function, and we keep the ratio the same. That gives rise to the trust region algorithm.

Trust region adjustment pseudocode

To summarize, consider the following sketch of an algorithm (see the trust region implementation code for further details):

Obtain $p_k$ by solving the trust region subproblem approximately (see below)
Adjust the trust region radius
1. Evaluate the ratio $$ \rho_k = \frac{f(x_k) - f(x_k + p_k) }{ m_k(0) - m_k(p) }$$
2. If $\rho_k < \frac{1}{4}$:
  1. Decrease the trust region radius: $\Delta_{k+1} = \frac{1}{4}\Delta_k$
3. Else:
  1. If $\rho_k > \frac{3}{4}$ and $||p_k|| = \Delta_k$:
    1. Increase the trust region radius: $\Delta_{k+1} = \min(2\Delta_k, \hat{\Delta})$
  2. Else:
    1. Keep the trust region radius the same: $\Delta_{k+1} = \Delta_{k}$
Update the iterate
1. If $\rho_k > \eta$, accept the step:
  1. $x_{k+1} = x_k + p_k$
2. Else:
  1. Reject the step and stay at the same value $x_{k+1} = x_k$

Here $\eta$ is a parameter in $[0, \frac{1}{4})$, and $\hat{\Delta} > 0$ is an upper bound on how big the trust region radius can get. We choose some initial trust region radius $\Delta_0 \in (0,\hat{\Delta})$. Note that the radius is only increased if $||p_k||$ reaches the boundary of the trust region.

Approximate solutions for solving the trust-region subproblem

Although we'd ideally solve the subproblem for the step $p$ exactly, in order to get global convergence, we only need to solve it approximately such that we get a sufficient reduction in the model.

The Cauchy point

One approximation to the trust region problem is to solve the easier problem of the linear model $$l_k(p) := f_k + g_k^\top p_k,$$ as opposed to the quadratic model $m_k(p)$ which has a second-order term.

We can quantify this in terms of the Cauchy point, denoted as $$p_k^c := \tau_k p_k^s.$$

Here the Cauchy point is a product of two terms:

$p_k^s$ is the vector that solves the linear version of the subproblem $$p_k^s = \arg\min_p f_k + g_k^T p \text{ s.t. } ||p|| \leq \Delta_k$$
$\tau_k >0$ minimizes the model $m_k(\tau p_k^s)$ such that it satisfies the trust-region bound $$\tau_k = \arg\min_{\tau > 0} m_k(\tau p_k^s) \text{ s.t. } ||\tau p_k^s|| \leq \Delta_k.$$

The closed-form expression of the Cauchy point is given by $$p_k^c = -\tau_k \frac{\Delta_k}{||g_k||} g_k,$$ where $$\tau_k = \begin{cases} 1 & \text{ if } g_k^\top B_k g_k \leq 0 \\ \min(\frac{||g_k||^3}{(\Delta_k g_k^\top B_k g_k)}, 1) &\text{ otherwise } \end{cases} $$

So, in the algorithm, whenever we accept the Cauchy point, we'd update our iterate as follows $$ x_{k+1} = x_k + p_k^c.$$

While the Cauchy step is inexpensive to compute and leads to a globally convergent trust-region method, it performs poorly even if an optimal step length is used at each iteration, as it is essentially the steepest descent method with a particular step length.

The Cauchy point doesn't rely heavily on the matrix $B_k$ (it's only used in computing the length of the step) -- an improvement on the Cauchy point will make use of $B_k$.

The dogleg method

Many trust region approximation algorithms are designed to improve upon the Cauchy point. The dogleg method is one such method. If the Cauchy point is on the boundary, we get a large decrease and so we will accept that step. Otherwise, the Cauchy point is on the interior and we want to instead take a "Newton" step $p_B = -B^{-1} g$, where the matrix $B$ only needs to be nonsingular (instead of positive definite).

The dogleg method combines two steps: \begin{align*} p^B &= -B^{-1} g \\ p^U &= -\frac{g^\top g}{g^\top B g} g \end{align*}

Note that $p^U$ is the constrained minimizer of $m_k$ along the steepest descent direction $-g_k$.

The dogleg path has the following trajectory: \begin{align*} \tilde{p}(\tau) = \begin{cases} \tau p^U & 0 \leq \tau \leq 1 \\ p^U + (\tau-1)(p^B - p^U) &1\leq\tau\leq 2 \end{cases} \end{align*}

We can compute $\tau$ by solving the following scalar quadratic equation: $$||p^U + (\tau-1) (p^B - p^U) ||^2 = \Delta^2,$$ which we can solve using the quadratic formula.

A quick summary:

if Cauchy point is on the boundary, i.e., $g_k^\top B_k g_k \leq 0$:
1. Use the Cauchy point $p_k^c$
if Cauchy point is in the interior, i.e., $g_k^\top B_k g_k > 0$
1. Use dogleg path $\tilde{p}$.

TL;DR: we'll have at least the decrease of the Cauchy point, and this allows for a Newton point direction when we're close to the solution.

Conjugate-gradient (Steihaug)

Global convergence

Testing the optimization methods

Fenton's function



In [13]:

    
xvals = trust_region([3.;4.], 6, 1, 0.1, fenton, fenton_g, fenton_h, 
    2000, "dogleg");
cvals = trust_region([3.;4.], 6, 1, 0.1, fenton, fenton_g, fenton_h, 
    2000, "cg_steihaug");

println(xvals[end])









    



Number of indefinite fixes 0
Number of iterations: 6
Number of indefinite fixes 0
Number of iterations: 7
[1.7434520869871946,2.0296947100002467]



In [14]:

    
nsamps = length(xvals)
nsamps2 = length(cvals)

fx = [fenton(xvals[i]) for i in 1:nsamps]
fx2 = [fenton(cvals[i]) for i in 1:nsamps2]


Gadfly.plot(layer(x=1:nsamps, y=fx, Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2, y=fx2, Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3, y=fx3, Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("f(x)"), Guide.title("Value of function"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)









    Out[14]:



In [17]:

    
nsamps = length(xvals)

grads = [norm(fenton_g(xvals[i]), 2) for i in 1:nsamps]
grads2 = [norm(fenton_g(cvals[i]), 2) for i in 1:nsamps2]
#grads3 = [norm(fenton_g(cvals[i]), 2) for i in 1:nsamps3]

Gadfly.plot(
layer(x=1:nsamps-2, y=grads[2:nsamps-1,:]./grads[1:nsamps-2,:], Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2-2, y=grads2[2:nsamps2-1,:]./grads2[1:nsamps2-2,:], Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3-1, y=grads3[2:nsamps3,:]./grads3[1:nsamps3-1,:], Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("gradient norm ratios"), Guide.title("gradient norm ratios"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)









    Out[17]:



In [16]:

    
Gadfly.plot(layer(x=1:nsamps, y=grads, Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2, y=grads2, Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3, y=grads3, Geom.line, Theme(default_color=color("orange"))),
    Guide.xlabel("iteration"), Guide.ylabel("gradient norm"), Guide.title("gradient norms"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
    Scale.x_log10, Scale.y_log10)









    Out[16]:

Rosenbrock function

The Rosenbrock function is defined by: $$f(x) = \sum_{i=1}^n [ (1-x_{2i-1})^2 + 10(x_{2i} - x_{2i-1}^2)^2 ]$$



In [18]:

    
xvals = trust_region(randn(100), 6, 3, 0.1, rosenbrock, 
    rosenbrock_g, rosenbrock_h, 1000, "dogleg");
cvals = trust_region(randn(100), 6, 3, 0.1, rosenbrock, 
    rosenbrock_g, rosenbrock_h, 1000, "cg_steihaug");









    



Number of indefinite fixes 8
Number of iterations: 8
Number of indefinite fixes 9
Number of iterations: 12



In [19]:

    
nsamps = length(xvals)
nsamps2 = length(cvals)

func = rosenbrock
func_g = rosenbrock_g

fx = [func(xvals[i]) for i in 1:nsamps]
fx2 = [func(cvals[i]) for i in 1:nsamps2]


Gadfly.plot(layer(x=1:nsamps, y=fx, Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2, y=fx2, Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3, y=fx3, Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("f(x)"), Guide.title("Value of function"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], 
    ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)









    Out[19]:



In [26]:

    
nsamps = length(xvals)
nsamps2 = length(cvals)

grads = [norm(func_g(xvals[i]), 2) for i in 1:nsamps]
grads2 = [norm(func_g(cvals[i]), 2) for i in 1:nsamps2]
#grads3 = [norm(func_g(cvals[i]), 2) for i in 1:nsamps3]


Gadfly.plot(
layer(x=1:nsamps-2, y=grads[2:nsamps-1,:]./grads[1:nsamps-2,:], Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2-2, y=grads2[2:nsamps2-1,:]./grads2[1:nsamps2-2,:], Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3-1, y=grads3[2:nsamps3,:]./grads3[1:nsamps3-1,:], Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("gradient norm ratios"), Guide.title("gradient norm ratios"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)









    Out[26]:



In [21]:

    
Gadfly.plot(layer(x=1:nsamps, y=grads, Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2, y=grads2, Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3, y=grads3, Geom.line, Theme(default_color=color("orange"))),
    Guide.xlabel("iteration"), Guide.ylabel("gradient norm"), Guide.title("gradient norms"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
    Scale.x_log10, Scale.y_log10)









    Out[21]:

Cute function

The cute function is given by $f: \mathbb{R}^n \rightarrow \mathbb{R}$ $$ f(x) = \sum_{i=1}^{n-4} (-4x_i+3)^2 + (x_i^2 + 2x_{i+1}^2 + 3x_{i+2}^2 + 4 x_{i+3}^2 + 5x_n^2)^2 $$



In [22]:

    
xvals = trust_region(ones(50)*10, 6, 3, 0.1, cute, cute_g, cute_h, 2000, "dogleg");
cvals = trust_region(ones(50)*10, 6, 3, 0.1, cute, cute_g, cute_h, 2000, "cg_steihaug");









    



Number of indefinite fixes 0
Number of iterations: 30
Number of indefinite fixes 0



In [23]:

    
nsamps = length(xvals)
nsamps2 = length(cvals)

func = cute
func_g = cute_g

fx = [func(xvals[i]) for i in 1:nsamps]
fx2 = [func(cvals[i]) for i in 1:nsamps2]


Gadfly.plot(layer(x=1:nsamps, y=fx, Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2, y=fx2, Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3, y=fx3, Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("f(x)"), Guide.title("Value of function"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)









    Out[23]:













    



Number of iterations: 32



In [24]:

    
nsamps = length(xvals)

grads = [norm(func_g(xvals[i]), 2) for i in 1:nsamps]
grads2 = [norm(func_g(cvals[i]), 2) for i in 1:nsamps2]
#grads3 = [norm(func_g(cvals[i]), 2) for i in 1:nsamps3]


Gadfly.plot(
layer(x=1:nsamps-1, y=grads[2:nsamps,:]./grads[1:nsamps-1,:], Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2-1, y=grads2[2:nsamps2,:]./grads2[1:nsamps2-1,:], Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3-1, y=grads3[2:nsamps3,:]./grads3[1:nsamps3-1,:], Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("gradient norm ratios"), Guide.title("gradient norm ratios"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)









    Out[24]:



In [25]:

    
Gadfly.plot(layer(x=1:nsamps, y=grads, Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2, y=grads2, Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3, y=grads3, Geom.line, Theme(default_color=color("orange"))),
    Guide.xlabel("iteration"), Guide.ylabel("gradient norm"), Guide.title("gradient norms"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
    Scale.x_log10, Scale.y_log10)









    Out[25]:

Cute2 function



In [27]:

    
xvals = trust_region(ones(50)*10, 6, 3, 0.1, cute2, cute2_g, cute2_h, 2000, "dogleg");
cvals = trust_region(ones(50)*10, 6, 3, 0.1, cute2, cute2_g, cute2_h, 2000, "cg_steihaug");









    



Number of indefinite fixes 0
Number of iterations: 27
Number of indefinite fixes 0



In [28]:

    
nsamps = length(xvals)
nsamps2 = length(cvals)

func = cute2
func_g = cute2_g

fx = [func(xvals[i]) for i in 1:nsamps]
fx2 = [func(cvals[i]) for i in 1:nsamps2]


Gadfly.plot(layer(x=1:nsamps, y=fx, Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2, y=fx2, Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3, y=fx3, Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("f(x)"), Guide.title("Value of function"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)









    Out[28]:



In [29]:

    
nsamps = length(xvals)

grads = [norm(func_g(xvals[i]), 2) for i in 1:nsamps]
grads2 = [norm(func_g(cvals[i]), 2) for i in 1:nsamps2]
#grads3 = [norm(func_g(cvals[i]), 2) for i in 1:nsamps3]


Gadfly.plot(
layer(x=1:nsamps-1, y=grads[2:nsamps,:]./grads[1:nsamps-1,:], Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2-1, y=grads2[2:nsamps2,:]./grads2[1:nsamps2-1,:], Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3-1, y=grads3[2:nsamps3,:]./grads3[1:nsamps3-1,:], Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("gradient norm ratios"), Guide.title("gradient norm ratios"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)









    



Number of iterations: 28






    Out[29]:



In [30]:

    
Gadfly.plot(layer(x=1:nsamps, y=grads, Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2, y=grads2, Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3, y=grads3, Geom.line, Theme(default_color=color("orange"))),
    Guide.xlabel("iteration"), Guide.ylabel("gradient norm"), Guide.title("gradient norms"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
    Scale.x_log10, Scale.y_log10)









    Out[30]:



In [ ]: