Consider the problem
$$ \min\limits_{x\ \in \mathbb{R}^n} f(x). $$From the necessary condition follows:
$$ f'(x) + f''(x) h = 0, \qquad h^* = -(f''(x))^{-1} f'(x) $$def NewtonMethod(f, x0, epsilon, **kwargs):
x = x0
while True:
h = ComputeNewtonStep(x, f, **kwargs)
if StopCriterion(x, f, h, **kwargs) < epsilon:
break
alpha = SelectStepSize(x, h, f, **kwargs)
x = x + alpha * h
return x
Theorem. Assue that $f(x)$ is
Then Newton method convrges quadratically:
$$ \|x_{k+1} - x^* \|_2 \leq \dfrac{M\|x_k - x^*\|^2_2}{2 (l - M\|x_k - x^*\|_2)} $$Consider function $f(x)$ и non-singular transformation with matrix $A$.
Check how Newton method direction will be changed after transformation $A$.
Let $x = Ay$ and $g(y) = f(Ay)$. Then
$$ g(y + u) \approx g(y) + \langle g'(y), u \rangle + \frac{1}{2} u^{\top} g''(y) u \to \min_{u} $$and
$$ u^* = -(g''(y))^{-1} g'(y) \qquad y_{k+1} = y_k - (g''(y_k))^{-1} g'(y_k) $$or
Thus,
$$ Ay_{k+1} = Ay_k - (f''(Ay_k))^{-1}f'(Ay_k) \quad x_{k+1} = x_k - (f''(x_k))^{-1}f'(x_k) $$Therefore, direction given by Newton method is transformed in the similar way as coordinates!
Matrix $\Delta E$ can be chosen in different ways using the following problem $$ \Delta E = \arg\min \|\Delta E\|, \quad \text{s.t. } f''(x) + \Delta E \succ 0 $$
As far as $\lambda(f''(x))$ is usually unavailable in every iteration, it is possible to modify Cholesky factorization algorithm such that it gives factor for the matrix $f''(x) + \Delta E$ instead of the initial matrix $f''(x)$
Bottlenecks in Newton method:
In [1]:
import numpy as np
USE_COLAB = False
if USE_COLAB:
!pip install git+https://github.com/amkatrutsa/liboptpy
import liboptpy.unconstr_solvers as methods
import liboptpy.step_size as ss
n = 1000
m = 200
x0 = np.zeros((n,))
A = np.random.rand(n, m) * 10
In [2]:
import cvxpy as cvx
x = cvx.Variable((n, 1))
obj = cvx.Minimize(cvx.sum(-cvx.log(1 - A.T * x)) -
cvx.sum(cvx.log(1 - cvx.square(x))))
prob = cvx.Problem(obj)
prob.solve(solver="SCS", verbose=True, max_iters=1000)
print("Optimal value =", prob.value)
In [3]:
f = lambda x: -np.sum(np.log(1 - A.T.dot(x))) - np.sum(np.log(1 - x*x))
grad_f = lambda x: np.sum(A.dot(np.diagflat(1 / (1 - A.T.dot(x)))), axis=1) + 2 * x / (1 - np.power(x, 2))
hess_f = lambda x: (A.dot(np.diagflat(1 / (1 - A.T.dot(x))**2))).dot(A.T) + np.diagflat(2 * (1 + x**2) / (1 - x**2)**2)
In [4]:
def Newton(f, gradf, hessf, x0, epsilon, num_iter, line_search,
disp=False, callback=None, **kwargs):
x = x0.copy()
iteration = 0
opt_arg = {"f": f, "grad_f": gradf}
for key in kwargs:
opt_arg[key] = kwargs[key]
while True:
gradient = gradf(x)
hess = hessf(x)
h = np.linalg.solve(hess, -gradient)
alpha = line_search(x, h, **opt_arg)
x = x + alpha * h
if callback is not None:
callback(x)
iteration += 1
if disp:
print("Current function val =", f(x))
print("Current gradient norm = ", np.linalg.norm(gradf(x)))
if np.linalg.norm(gradf(x)) < epsilon:
break
if iteration >= num_iter:
break
res = {"x": x, "num_iter": iteration, "tol": np.linalg.norm(gradf(x))}
return res
In [5]:
newton = methods.so.NewtonMethod(f, grad_f, hess_f, ss.Backtracking("Armijo", rho=0.9, beta=0.1, init_alpha=1.))
x_newton = newton.solve(x0, tol=1e-6, max_iter=50, disp=True)
gd = methods.fo.GradientDescent(f, grad_f, ss.Backtracking("Armijo", rho=0.9, beta=0.1, init_alpha=1.))
x_gd = gd.solve(x0, tol=1e-6, max_iter=50, disp=True)
In [6]:
%matplotlib inline
import matplotlib.pyplot as plt
if not USE_COLAB:
plt.rc("text", usetex=True)
plt.figure(figsize=(12, 8))
# Newton
plt.semilogy([np.linalg.norm(grad_f(x)) for x in newton.get_convergence()], label="$\| f'(x_k) \|^{N}_2$")
# Gradient
plt.semilogy([np.linalg.norm(grad_f(x)) for x in gd.get_convergence()], label="$\| f'(x_k) \|^{G}_2$")
plt.xlabel(r"Number of iterations, $k$", fontsize=26)
plt.ylabel(r"Convergence rate", fontsize=26)
plt.xticks(fontsize = 24)
plt.yticks(fontsize = 24)
plt.legend(loc="best", fontsize=24)
Out[6]:
In [7]:
%timeit newton.solve(x0, tol=1e-6, max_iter=50)
%timeit gd.solve(x0, tol=1e-6, max_iter=50)
Method | Convergence speed | Complexity | Affine invariance | Restrictions to $f(x)$ |
---|---|---|---|---|
GD | Global linear | $O(n) + $ step size search | No | Differentiable, Lipschits gradient |
Newton method | Local quadratic | $O(n^3) + $ step size search | Yes | Twice differentiable; Lipschitz and positive definite hessian |