Code for the algorithms and functions test are available in the following files:
In [5]:
include("../optimizers/trustregion.jl")
include("../utils/functions.jl")
using Gadfly
Let $f_k := f(x_k)$ be the value of the function at iterate $x_k$, and let $g_k := \nabla f_k$, be the value of the gradient at iterate $x_k$.
Recall the quadratic model: $$m_k(p) = f_k + g_k^\top p + p^\top B_k p,$$ which comes from the second-order Taylor expansion of $f$.
Here $B_k$ is some symmetric matrix. When $B_k$ is the exact Hessian, we get the trust region Newton method.
We first solve the trust region subproblem: $$\min_p m_k(p) \text{ s.t. } ||p|| \leq \Delta_k,$$ where $\Delta_k$ is the trust region radius. In practice, we solve the subproblem approximately -- see the next section for more details.
We define the following ratio at each iteration $k$: $$ \rho_k = \frac{f(x_k) - f(x_k + p_k) }{ m_k(0) - m_k(p) },$$ where the numerator gives the actual reduction, difference of the function at the current step and the next step, and the denominator is the predicted reduction, i.e., the reduction in $f$ predicted by the model function.
The predicted reduction $m_k(0) - m_k(p)$ is always nonnegative. So, the value of the actual reduction tells us how to adjust the trust region model. If the actual reduction $$f(x_k) - f(x_k + p_k) < 0,$$ then the new objective value $f(x_k + p_k)$ is greater than the current value of $f(x_k)$, so this step should be rejected.
Otherwise, if the actual reduction is highly positive, we can increase the trust region. If it's close to zero or negative, i.e., the model in that region doesn't match the function well, we want to decrease the trust region radius. If the ratio is close to 1, that means the model agrees well with the function, and we keep the ratio the same. That gives rise to the trust region algorithm.
To summarize, consider the following sketch of an algorithm (see the trust region implementation code for further details):
Adjust the trust region radius
Update the iterate
Here $\eta$ is a parameter in $[0, \frac{1}{4})$, and $\hat{\Delta} > 0$ is an upper bound on how big the trust region radius can get. We choose some initial trust region radius $\Delta_0 \in (0,\hat{\Delta})$. Note that the radius is only increased if $||p_k||$ reaches the boundary of the trust region.
Although we'd ideally solve the subproblem for the step $p$ exactly, in order to get global convergence, we only need to solve it approximately such that we get a sufficient reduction in the model.
One approximation to the trust region problem is to solve the easier problem of the linear model $$l_k(p) := f_k + g_k^\top p_k,$$ as opposed to the quadratic model $m_k(p)$ which has a second-order term.
We can quantify this in terms of the Cauchy point, denoted as $$p_k^c := \tau_k p_k^s.$$
Here the Cauchy point is a product of two terms:
The closed-form expression of the Cauchy point is given by $$p_k^c = -\tau_k \frac{\Delta_k}{||g_k||} g_k,$$ where $$\tau_k = \begin{cases} 1 & \text{ if } g_k^\top B_k g_k \leq 0 \\ \min(\frac{||g_k||^3}{(\Delta_k g_k^\top B_k g_k)}, 1) &\text{ otherwise } \end{cases} $$
So, in the algorithm, whenever we accept the Cauchy point, we'd update our iterate as follows $$ x_{k+1} = x_k + p_k^c.$$
While the Cauchy step is inexpensive to compute and leads to a globally convergent trust-region method, it performs poorly even if an optimal step length is used at each iteration, as it is essentially the steepest descent method with a particular step length.
The Cauchy point doesn't rely heavily on the matrix $B_k$ (it's only used in computing the length of the step) -- an improvement on the Cauchy point will make use of $B_k$.
Many trust region approximation algorithms are designed to improve upon the Cauchy point. The dogleg method is one such method. If the Cauchy point is on the boundary, we get a large decrease and so we will accept that step. Otherwise, the Cauchy point is on the interior and we want to instead take a "Newton" step $p_B = -B^{-1} g$, where the matrix $B$ only needs to be nonsingular (instead of positive definite).
The dogleg method combines two steps: \begin{align*} p^B &= -B^{-1} g \\ p^U &= -\frac{g^\top g}{g^\top B g} g \end{align*}
Note that $p^U$ is the constrained minimizer of $m_k$ along the steepest descent direction $-g_k$.
The dogleg path has the following trajectory: \begin{align*} \tilde{p}(\tau) = \begin{cases} \tau p^U & 0 \leq \tau \leq 1 \\ p^U + (\tau-1)(p^B - p^U) &1\leq\tau\leq 2 \end{cases} \end{align*}
We can compute $\tau$ by solving the following scalar quadratic equation: $$||p^U + (\tau-1) (p^B - p^U) ||^2 = \Delta^2,$$ which we can solve using the quadratic formula.
A quick summary:
TL;DR: we'll have at least the decrease of the Cauchy point, and this allows for a Newton point direction when we're close to the solution.
In [13]:
xvals = trust_region([3.;4.], 6, 1, 0.1, fenton, fenton_g, fenton_h,
2000, "dogleg");
cvals = trust_region([3.;4.], 6, 1, 0.1, fenton, fenton_g, fenton_h,
2000, "cg_steihaug");
println(xvals[end])
In [14]:
nsamps = length(xvals)
nsamps2 = length(cvals)
fx = [fenton(xvals[i]) for i in 1:nsamps]
fx2 = [fenton(cvals[i]) for i in 1:nsamps2]
Gadfly.plot(layer(x=1:nsamps, y=fx, Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2, y=fx2, Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3, y=fx3, Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("f(x)"), Guide.title("Value of function"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)
Out[14]:
In [17]:
nsamps = length(xvals)
grads = [norm(fenton_g(xvals[i]), 2) for i in 1:nsamps]
grads2 = [norm(fenton_g(cvals[i]), 2) for i in 1:nsamps2]
#grads3 = [norm(fenton_g(cvals[i]), 2) for i in 1:nsamps3]
Gadfly.plot(
layer(x=1:nsamps-2, y=grads[2:nsamps-1,:]./grads[1:nsamps-2,:], Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2-2, y=grads2[2:nsamps2-1,:]./grads2[1:nsamps2-2,:], Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3-1, y=grads3[2:nsamps3,:]./grads3[1:nsamps3-1,:], Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("gradient norm ratios"), Guide.title("gradient norm ratios"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)
Out[17]:
In [16]:
Gadfly.plot(layer(x=1:nsamps, y=grads, Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2, y=grads2, Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3, y=grads3, Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("gradient norm"), Guide.title("gradient norms"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)
Out[16]:
In [18]:
xvals = trust_region(randn(100), 6, 3, 0.1, rosenbrock,
rosenbrock_g, rosenbrock_h, 1000, "dogleg");
cvals = trust_region(randn(100), 6, 3, 0.1, rosenbrock,
rosenbrock_g, rosenbrock_h, 1000, "cg_steihaug");
In [19]:
nsamps = length(xvals)
nsamps2 = length(cvals)
func = rosenbrock
func_g = rosenbrock_g
fx = [func(xvals[i]) for i in 1:nsamps]
fx2 = [func(cvals[i]) for i in 1:nsamps2]
Gadfly.plot(layer(x=1:nsamps, y=fx, Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2, y=fx2, Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3, y=fx3, Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("f(x)"), Guide.title("Value of function"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"],
["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)
Out[19]:
In [26]:
nsamps = length(xvals)
nsamps2 = length(cvals)
grads = [norm(func_g(xvals[i]), 2) for i in 1:nsamps]
grads2 = [norm(func_g(cvals[i]), 2) for i in 1:nsamps2]
#grads3 = [norm(func_g(cvals[i]), 2) for i in 1:nsamps3]
Gadfly.plot(
layer(x=1:nsamps-2, y=grads[2:nsamps-1,:]./grads[1:nsamps-2,:], Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2-2, y=grads2[2:nsamps2-1,:]./grads2[1:nsamps2-2,:], Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3-1, y=grads3[2:nsamps3,:]./grads3[1:nsamps3-1,:], Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("gradient norm ratios"), Guide.title("gradient norm ratios"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)
Out[26]:
In [21]:
Gadfly.plot(layer(x=1:nsamps, y=grads, Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2, y=grads2, Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3, y=grads3, Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("gradient norm"), Guide.title("gradient norms"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)
Out[21]:
In [22]:
xvals = trust_region(ones(50)*10, 6, 3, 0.1, cute, cute_g, cute_h, 2000, "dogleg");
cvals = trust_region(ones(50)*10, 6, 3, 0.1, cute, cute_g, cute_h, 2000, "cg_steihaug");
In [23]:
nsamps = length(xvals)
nsamps2 = length(cvals)
func = cute
func_g = cute_g
fx = [func(xvals[i]) for i in 1:nsamps]
fx2 = [func(cvals[i]) for i in 1:nsamps2]
Gadfly.plot(layer(x=1:nsamps, y=fx, Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2, y=fx2, Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3, y=fx3, Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("f(x)"), Guide.title("Value of function"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)
Out[23]:
In [24]:
nsamps = length(xvals)
grads = [norm(func_g(xvals[i]), 2) for i in 1:nsamps]
grads2 = [norm(func_g(cvals[i]), 2) for i in 1:nsamps2]
#grads3 = [norm(func_g(cvals[i]), 2) for i in 1:nsamps3]
Gadfly.plot(
layer(x=1:nsamps-1, y=grads[2:nsamps,:]./grads[1:nsamps-1,:], Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2-1, y=grads2[2:nsamps2,:]./grads2[1:nsamps2-1,:], Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3-1, y=grads3[2:nsamps3,:]./grads3[1:nsamps3-1,:], Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("gradient norm ratios"), Guide.title("gradient norm ratios"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)
Out[24]:
In [25]:
Gadfly.plot(layer(x=1:nsamps, y=grads, Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2, y=grads2, Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3, y=grads3, Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("gradient norm"), Guide.title("gradient norms"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)
Out[25]:
In [27]:
xvals = trust_region(ones(50)*10, 6, 3, 0.1, cute2, cute2_g, cute2_h, 2000, "dogleg");
cvals = trust_region(ones(50)*10, 6, 3, 0.1, cute2, cute2_g, cute2_h, 2000, "cg_steihaug");
In [28]:
nsamps = length(xvals)
nsamps2 = length(cvals)
func = cute2
func_g = cute2_g
fx = [func(xvals[i]) for i in 1:nsamps]
fx2 = [func(cvals[i]) for i in 1:nsamps2]
Gadfly.plot(layer(x=1:nsamps, y=fx, Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2, y=fx2, Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3, y=fx3, Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("f(x)"), Guide.title("Value of function"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)
Out[28]:
In [29]:
nsamps = length(xvals)
grads = [norm(func_g(xvals[i]), 2) for i in 1:nsamps]
grads2 = [norm(func_g(cvals[i]), 2) for i in 1:nsamps2]
#grads3 = [norm(func_g(cvals[i]), 2) for i in 1:nsamps3]
Gadfly.plot(
layer(x=1:nsamps-1, y=grads[2:nsamps,:]./grads[1:nsamps-1,:], Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2-1, y=grads2[2:nsamps2,:]./grads2[1:nsamps2-1,:], Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3-1, y=grads3[2:nsamps3,:]./grads3[1:nsamps3-1,:], Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("gradient norm ratios"), Guide.title("gradient norm ratios"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)
Out[29]:
In [30]:
Gadfly.plot(layer(x=1:nsamps, y=grads, Geom.line, Theme(default_color=color("blue"))),
layer(x=1:nsamps2, y=grads2, Geom.line, Theme(default_color=color("red"))),
#layer(x=1:nsamps3, y=grads3, Geom.line, Theme(default_color=color("orange"))),
Guide.xlabel("iteration"), Guide.ylabel("gradient norm"), Guide.title("gradient norms"),
Guide.manual_color_key("Legend", ["Newton dogleg", "Steihaug-CG", "quasi-Newton SR1"], ["blue", "red", "orange"]),
Scale.x_log10, Scale.y_log10)
Out[30]:
In [ ]: