Method | Convergence speed | Complexity | Affine invariance | Restrictions to $f(x)$ |
---|---|---|---|---|
GD | Global linear | $O(n) + $ step size search | No | Differentiable, Lipschits gradient |
Newton method | Local quadratic | $O(n^3) + $ step size search | Yes | Twice differentiable; Lipschitz and positive definite hessian |
Computational complexity can be reduced with
Computational and memory complexity can be reduced with
where $\alpha \in (0, 1/L], f(x) \leq f_G(x)$, which means $f_G$ is global upper estimate of $f(x)$
Idea: use intermediate approximation in the form
$$ f_q(x) \approx f(y) + \langle f'(y), x - y \rangle + \frac{1}{2} (x-y)^{\top}{\color{red}{B(y)}}(x-y), $$which gives the following rule for update current point:
$$ x_{k+1} = x_k - \alpha_k B^{-1}_k f'(x_k) = x_k - \alpha_k H_k f'(x_k) $$where the first equation gives
$$ f'(x_{k+1}) - \alpha_k B_{k+1}h_k = f'(x_k), $$and the second one holds by construction.
Q: does the relation between vectors $s_k$ and $y_k$ always hold?
Hint: remember about Wolfe condition
Q: is matrix $B_{k+1}$ unique?
where $\rho_k = \dfrac{1}{y^{\top}_k s_k}$,
or with Sherman-Morrison-Woodbury formula
$$ B^{-1}_{k+1} = H_{k+1} = H_k - \dfrac{H_ky_k y_k^{\top}H_k}{y^{\top}_kH_ky_k} + \dfrac{s_ks^{\top}_k}{y^{\top}_ks_k} $$Q: what rank has difference between $B_{k+1} (H_{k+1})$ and $B_{k} (H_{k})$?
Update equation for BFGS:
$$ H_{k+1} = (I - \rho_k s_ky^{\top}_k)H_k(I - \rho_k y_k s^{\top}_k) + \rho_k s_k s^{\top}_k, $$where $\rho_k = \dfrac{1}{y^{\top}_k s_k}$
Idea
or
$$ h_{k+1} = -\left(I - \frac{s_k y_k^{\top}}{y_k^{\top}s_k}\right)f'(x_{k+1}) = -\hat{H}_{k+1} f'(x_{k+1}) $$satisfies all requirements on the matrix in BFGS and equation is equal to the update of $H_k$, if $H_k = I$, i.e. $k=1$ in L-BFGS and $H_0 = I$
The first way
Problem
$$ \min_{\beta} \|\beta s_{k-1} - y_{k-1} \|^2_2 $$
Solution
$$ \alpha = \frac{1}{\beta} = \frac{s^{\top}_{k-1} s_{k-1}}{s^{\top}_{k-1} y_{k-1}} $$
The second way
Problem
$$ \min_{\alpha} \| s_{k-1} - \alpha y_{k-1} \|^2_2 $$
Solution
$$ \alpha = \frac{s^{\top}_{k-1} y_{k-1}}{y^{\top}_{k-1} y_{k-1}} $$
The third way is called non-monotone line search: specific mofifcation of the Armijo rule taking into account history of changing objective function values, more details see in the paper 2004, in SIAM Journal on Optimization
In [1]:
import numpy as np
import liboptpy.unconstr_solvers as methods
import liboptpy.step_size as ss
%matplotlib inline
import matplotlib.pyplot as plt
import scipy.optimize as scopt
plt.rc("text", usetex=True)
In [2]:
n = 3000
m = 100
x0 = np.zeros(n)
max_iter = 100
tol = 1e-5
A = np.random.rand(m, n) * 10
In [3]:
f = lambda x: -np.sum(np.log(1 - A.dot(x))) - np.sum(np.log(1 - x*x))
grad_f = lambda x: np.sum(A.T / (1 - A.dot(x)), axis=1) + 2 * x / (1 - np.power(x, 2))
In [4]:
def bb_method(f, gradf, x0, tol=1e-6, maxiter=100, callback=None, alpha_type=1):
it = 0
x_prev = x0.copy()
current_tol = np.linalg.norm(gradf(x_prev))
alpha = 1e-4
while current_tol > tol and it < maxiter:
it += 1
current_grad = gradf(x_prev)
if it != 1:
g = current_grad - prev_grad
if alpha_type == 1:
alpha = g.dot(s) / g.dot(g)
elif alpha_type == 2:
alpha = s.dot(s) / g.dot(s)
if callback:
callback(x_prev)
x_next = x_prev - alpha * current_grad
current_tol = np.linalg.norm(gradf(x_next))
prev_grad = current_grad
s = x_next - x_prev
x_prev = x_next
if callback:
callback(x_prev)
return x_next
In [5]:
method = {
"BB 1": methods.fo.BarzilaiBorweinMethod(f, grad_f, init_alpha=1e-4, type=1),
"BFGS": methods.fo.BFGS(f, grad_f),
"DFP": methods.fo.DFP(f, grad_f),
"LBFGS": methods.fo.LBFGS(f, grad_f),
}
In [6]:
for m in method:
print("\t Method {}".format(m))
_ = method[m].solve(x0=x0, tol=tol, max_iter=max_iter, disp=True)
print("\t Method BFGS Scipy")
scopt_conv = []
scopt_res = scopt.minimize(f, x0, method="BFGS", jac=grad_f, callback=lambda x: scopt_conv.append(x),
tol=tol, options={"maxiter": max_iter})
print("Result: {}".format(scopt_res.message))
if scopt_res.success:
print("Convergence in {} iterations".format(scopt_res.nit))
print("Function value = {}".format(f(scopt_res.x)))
In [7]:
plt.figure(figsize=(8, 6))
for m in method:
plt.semilogy([np.linalg.norm(grad_f(x)) for x in method[m].get_convergence()], label=m)
plt.semilogy([np.linalg.norm(grad_f(x)) for x in [x0] + scopt_conv], label="BFGS SciPy")
plt.ylabel("$\|f'(x_k)\|_2$", fontsize=18)
plt.xlabel("Number of iterations, $k$", fontsize=18)
plt.legend(fontsize=18)
plt.xticks(fontsize=18)
_ = plt.yticks(fontsize=18)
In [9]:
for m in method:
print("\t Method {}".format(m))
%timeit method[m].solve(x0=x0, tol=tol, max_iter=max_iter)
%timeit scopt.minimize(f, x0, method="BFGS", jac=grad_f, tol=tol, options={"maxiter": max_iter})
In [8]:
n = 50
D = np.arange(1, n+1)
U = np.random.randn(n, n)
U, _ = np.linalg.qr(U)
A = U.dot(np.diag(D)).dot(U.T)
b = np.random.randn(n)
eig_vals = np.linalg.eigvals(A)
print("Condition number = {}".format(np.max(eig_vals) / np.min(eig_vals)))
In [9]:
f = lambda x: 0.5 * x.T.dot(A.dot(x)) - b.dot(x)
gradf = lambda x: A.dot(x) - b
x0 = np.random.randn(n)
In [21]:
method = {
"BB 1": methods.fo.BarzilaiBorweinMethod(f, gradf, init_alpha=1e-4, type=1),
"BB 2": methods.fo.BarzilaiBorweinMethod(f, gradf, init_alpha=1e-4, type=2),
"BFGS": methods.fo.BFGS(f, gradf),
"DFP": methods.fo.DFP(f, gradf),
"GD": methods.fo.GradientDescent(f, gradf, ss.ExactLineSearch4Quad(A, b)),
"LBFGS": methods.fo.LBFGS(f, gradf, hist_size=10),
}
In [22]:
for m in method:
print("\t Method {}".format(m))
_ = method[m].solve(x0=x0, tol=tol, max_iter=max_iter, disp=True)
print("\t Method BFGS Scipy")
scopt_conv = []
scopt_res = scopt.minimize(f, x0, method="BFGS", jac=gradf, callback=lambda x: scopt_conv.append(x),
tol=tol, options={"maxiter": max_iter})
print("Result: {}".format(scopt_res.message))
if scopt_res.success:
print("Convergence in {} iterations".format(scopt_res.nit))
print("Function value = {}".format(f(scopt_res.x)))
In [23]:
plt.figure(figsize=(12, 8))
fontsize = 26
for m in method:
plt.semilogy([np.linalg.norm(gradf(x)) for x in method[m].get_convergence()], label=m)
plt.semilogy([np.linalg.norm(gradf(x)) for x in [x0] + scopt_conv], label='BFGS SciPy')
plt.legend(fontsize=fontsize)
plt.ylabel("$\|f'(x_k)\|_2$", fontsize=fontsize)
plt.xlabel("Number of iterations, $k$", fontsize=fontsize)
plt.xticks(fontsize=fontsize)
_ = plt.yticks(fontsize=fontsize)
Pro:
Contra: