Seminar 12

Introduction to numerical optimization (Y. E. Nesterov Introduction to convex optimization, ch. 1 $\S$ 1.1)

Review of the spring term topics
Problem statement
General scheme of any optimization method
Comparison of optimization methods
Methods to solve one-dimansional minimization problem

Syllabus of the spring term

Methods to solve unconstrained optimization problem
- One-dimensional optimization problem (already today!)
- Gradient descent
- Newton method
- Quasi-Newton methods
- Conjugate gradient method
- Optional:
  - Lerast squares problem
  - Optimal methods and lower bounds
Methods to solve constrained optimization problem
- Linear programming
- Projected gradient method and conditional gradient method
- Barrier method
- Penalty function methods
- Augmented Lagrangian method

Organizational

One seminar and one lecture per week
Two problem sets
Midterm in the middle of the term
Final test at the end of the term
Oral exam at the end of the term (grading for a semester is similar to the fall term)
Minitests in the deginning of every class
Homework assignment almost every week: $\TeX$ or Jupyter Notebook

Problem statement

\begin{equation} \begin{split} & \min_{x \in S} \; f_0(x)\\ \text{s.t. } & f_j(x) = 0, \; j = 1,\ldots,m\\ & g_k(x) \leq 0, \; k = 1,\ldots,p \end{split} \end{equation}

where $S \subseteq \mathbb{R}^n$, $f_j: S \rightarrow \mathbb{R}, \; j = 0,\ldots,m$, $g_k: S \rightarrow \mathbb{R}, \; k=1,\ldots,p$

All functions are at least continuous.

Important fact: nonlinear optimization problem in its general form is

numerically intractable!

Analytical results

First order necessary condition:

if $x^*$ is a local minimum point of the differentiable function $f(x)$, then $$ f'(x^*) = 0 $$

Second order necessary condition:

if $x^*$ is a local minimum point of the twice differentiable function $f(x)$, then

$$ f'(x^*) = 0 \quad \text{и} \quad f''(x^*) \succeq 0 $$

Sufficient condition:

Assume $f(x)$ is twice differentiable function and $x^*$ satisfies the following condition

$$ f'(x^*) = 0 \quad f''(x^*) \succ 0, $$

then $x^*$ is a strict local minimum point of function $f(x)$.

Remark: check that you can prove these claims!

Features of numerical solutions

Exact solution of the given problem is impossible due to precision of machine arithmetic
It is necessary to define the way to check if current point is a solution or not
It is necessary to define what information about the problem is stored

General iterative scheme

Given: initial guess $x$, required tolerance $\varepsilon$.

def GeneralScheme(x, epsilon):

    while StopCriterion(x) > epsilon:

        OracleResponse = RequestOracle(x)

        UpdateInformation(I, x, OracleResponse)

        x = NextPoint(I, x)

    return x

Questions

What are the possible stopping criteria?
What is an oracle and what is it for?
What is information model?
How to get next point?

Stopping criteria

Convergence in $x$: $$ \| x_k - x^* \|_2 < \varepsilon $$
Convergence in $f$: $$ \| f_k - f^* \|_2 < \varepsilon $$
Necessary condition $$ \| f'(x_k) \|_2 < \varepsilon $$

But we don't know $x^*$!

Then $$ \|x_{k+1} - x_k \| = \|x_{k+1} - x_k + x^* - x^* \| \leq \|x_{k+1} - x^* \| + \| x_k - x^* \| \leq 2\varepsilon $$

The same is true for the convergence in $f$, but sometimes $f^*$ can be estimated!

Remark: better practise is to use relative difference in argument and functional, for example $\dfrac{\|x_{k+1} - x_k \|_2}{\| x_k \|_2}$

What is oracle?

Definition: oracle is some abstact machine that responses on the sequaential method requests

OOP analogy:

oracle is a virtual method of the base class
every problem is derived class
oracle is defined for every particular problem according to the declaration in the base class

Black box concept

Iterative method can use only oracle responses
Oracle responses are local

Information about the problem

Every oracle response gives local information about function behaviour in the given point
After aggregation of the oracle responses, we update global information about objective function:
- curvature
- descent direction
- etc

Compute next point

$$ x_{k+1} = x_{k} + \alpha_k h_k $$

Line search: fix direction $h_k$ and search for this direction the optimal value of $\alpha_k$
Trust region method: fix appropriate size of region in some norm $\| \cdot \| \leq \alpha$ and model of the objective function, which is a good approximation in the considered region. Next, we search direction $h_k$, that minimizes the chosen model of the objective function and does not lead to the point $x_k + h_k$ lying outside of the considered region

Questions:

How to choose $\alpha_k$?
How to choose $h_k$?
How to choose model?
How to choose region?
How to choose region size?

In this course we consider only line search methods!

However someiemes the concept of trust region methods will be helpful.

How to compare optimization methods?

For given class of problems one can compare the following quantities:

Complexity
- analytical: number of the oracle requests to solve the problem with accuracy $\varepsilon$
- arithmetic: total number of computations to solve the problem with accuracy $\varepsilon$
Convergence speed
Experiments

Convergence speed

Sublinear $$ \| x_{k+1} - x^* \|_2 \leq C k^{\alpha}, $$ where $\alpha < 0$ и $ 0 < C < \infty$
Linear (geometric progression) $$ \| x_{k+1} - x^* \|_2 \leq Cq^k, $$ where $q \in (0, 1)$ and $ 0 < C < \infty$
Superlinear $$ \| x_{k+1} - x^* \|_2 \leq Cq^{k^p}, $$ where $q \in (0, 1)$, $ 0 < C < \infty$ and $p > 1$
Quadratic $$ \| x_{k+1} - x^* \|_2 \leq C\| x_k - x^* \|^2_2, \qquad \text{or} \qquad \| x_{k+1} - x^* \|_2 \leq C q^{2^k} $$ where $q \in (0, 1)$ and $ 0 < C < \infty$

Optimal methods: can we do better?

Authors prove lower bounds of the convergence speed for given set of problems and methods of given order
Next, the methods, for which these lower bounds are tight, were proposed $\Rightarrow$ optimality is proved
Later more about convergence theorem



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt

USE_COLAB = False
if not USE_COLAB:
    plt.rc("text", usetex=True)
    
import numpy as np
C = 10
alpha = -0.5
q = 0.9
num_iter = 7
sublinear = np.array([C * k**alpha for k in range(1, num_iter + 1)])
linear = np.array([C * q**k for k in range(1, num_iter + 1)])
superlinear = np.array([C * q**(k**2) for k in range(1, num_iter + 1)])
quadratic = np.array([C * q**(2**k) for k in range(1, num_iter + 1)])
plt.figure(figsize=(12,8))
plt.semilogy(np.arange(1, num_iter+1), sublinear, 
             label=r"Sublinear, $\alpha = -0.5$")
plt.semilogy(np.arange(1, num_iter+1), superlinear, 
             label=r"Superlinear, $q = 0.5, p=2$")
plt.semilogy(np.arange(1, num_iter+1), linear, 
             label=r"Linear, $q = 0.5$")
plt.semilogy(np.arange(1, num_iter+1), quadratic, 
             label=r"Quadratic, $q = 0.5$")
plt.xlabel("Number of iteration, $k$", fontsize=24)
plt.ylabel("Error rate upper bound", fontsize=24)
plt.legend(loc="best", fontsize=28)
plt.xticks(fontsize = 28)
_ = plt.yticks(fontsize = 28)

On interpretations of the convergence theorems (B.T. Polyak Introduction to optimization, ch. 1, $\S$ 6)

What answers give convergence theorems?
- class of problems for which the method is applicable (it is important to minimize the number of restrictions!)
  - convexity
  - smoothness
- qualitative behaviour of the method
  - is initial guess significant for convergence of method?
  - what convergence can be observed?
- estimate of convergence speed
  - theoretical estimate of the method behavior without experiments
  - detection of properties that affect convergence (condition number, dimension, etc)
  - sometimes one can set number of iterations to achieve required accuracy in advance
What answers do not give convergence theorem?
- convergence of the method does not mean that it has to be used
- convergence estimates depend on unknown constants - only theoretical interest
- theorems do not take into account rounding errors and accuracy of solving auxiliary problems

Main point: it is necessary to use common sense and be careful in usage of convergence theorem!

Classes of problems

Uncontrained optimization
- Lipschitz objective function
- Lipschitz gradient of the objective function
Contrained optimization
- polytope
- sets with simple structure
- general form

Classes of methods

Zero order method: oracle returns only objective function $f(x)$
Fist order method : oracle returns objective function $f(x)$ and its gradient $f'(x)$
Second order method: oracle returns objective function $f(x)$, its gradient $f'(x)$ and its hessian $f''(x)$.

Q: do methods of higher order exist?

One-step methods $$ x_{k+1} = \Phi(x_k) $$
Multi-step methods $$ x_{k+1} = \Phi(x_k, x_{k-1}, ...) $$

One-dimensional optimization

Definition. Funtion $f(x)$ is unimodal in interval $[a, b]$, if there exists such point $x^* \in [a, b]$, that

$f(x_1) > f(x_2)$ for any $a \leq x_1 < x_2 < x^*$,

and

$f(x_1) < f(x_2)$ for any $x^* < x_1 < x_2 \leq b$.

Q: what geometry of unimodal function?

Bisection method

Idea from the first term CS course: divide given interval $[a,b]$ on two equal parts till minimum of the unimodal function is not found

Denite by $N$ the number of computations of function $f$, then one can perform $K = \frac{N - 1}{2}$ iterations and the following estimate holds: $$ |x_{K+1} - x^*| \leq \frac{b_{K+1} - a_{K+1}}{2} = \left( \frac{1}{2} \right)^{\frac{N-1}{2}} (b - a) \approx 0.5^{K} (b - a) $$



In [2]:

    
def binary_search(f, a, b, epsilon, callback=None):
    c = (a + b) / 2.0
    while abs(b - a) > epsilon:
#         Check left subsegment
        y = (a + c) / 2.0
        if f(y) <= f(c):
            b = c
            c = y
        else:
#         Check right subsegment
            z = (b + c) / 2.0
            if f(c) <= f(z):
                a = y
                b = z
            else:
                a = c
                c = z
        if callback is not None:
            callback(a, b)
    return c



In [3]:

    
def my_callback(a, b, left_bound, right_bound, approximation):
    left_bound.append(a)
    right_bound.append(b)
    approximation.append((a + b) / 2.0)



In [4]:

    
# %matplotlib inline
import numpy as np
# import matplotlib.pyplot as plt

left_boud_bs = []
right_bound_bs = []
approximation_bs = []

callback_bs = lambda a, b: my_callback(a, b, 
            left_boud_bs, right_bound_bs, approximation_bs)

# Target unimodal function on given segment
f = lambda x: (x - 2) * x * (x + 2)**2 # np.power(x+2, 2)
# f = lambda x: -np.sin(x)
x_true = -2
# x_true = np.pi / 2.0
a = -3
b = -1.5
epsilon = 1e-8
x_opt = binary_search(f, a, b, epsilon, callback_bs)
print(np.abs(x_opt - x_true))
plt.figure(figsize=(10,6))
plt.plot(np.linspace(a,b), f(np.linspace(a,b)))
plt.title("Objective function", fontsize=20)
plt.xticks(fontsize=20)
_ = plt.yticks(fontsize=20)









    



9.313225746154785e-10

Golden search method

Idea: divide interval $[a,b]$ not on two eqwual parts, but in the golden ratio.

Estimate convergence speed like in bisection method: $$ |x_{K+1} - x^*| \leq b_{K+1} - a_{K+1} = \left( \frac{1}{\tau} \right)^{N-1} (b - a) \approx 0.618^K(b-a), $$ where $\tau = \frac{\sqrt{5} + 1}{2}$.

Constant of linear convergence is higher, than corresponding constant in bisection method
Number of function calls is less than for the bisection method



In [5]:

    
def golden_search(f, a, b, tol=1e-5, callback=None):
    tau = (np.sqrt(5) + 1) / 2.0
    y = a + (b - a) / tau**2
    z = a + (b - a) / tau
    while b - a > tol:
        if f(y) <= f(z):
            b = z
            z = y
            y = a + (b - a) / tau**2
        else:
            a = y
            y = z
            z = a + (b - a) / tau
        if callback is not None:
            callback(a, b)
    return (a + b) / 2.0



In [6]:

    
left_boud_gs = []
right_bound_gs = []
approximation_gs = []

cb_gs = lambda a, b: my_callback(a, b, left_boud_gs, right_bound_gs, approximation_gs)
x_gs = golden_search(f, a, b, epsilon, cb_gs)

print(f(x_opt))
print(f(x_gs))
print(np.abs(x_opt - x_true))









    



6.93889390875399e-18
9.549014390504221e-18
9.313225746154785e-10

Comparison of the methods for one-dimensional problem



In [7]:

    
plt.figure(figsize=(10,6))
plt.semilogy(np.arange(1, len(approximation_bs) + 1), np.abs(x_true - np.array(approximation_bs, dtype=np.float64)), label="Binary search")
plt.semilogy(np.arange(1, len(approximation_gs) + 1), np.abs(x_true - np.array(approximation_gs, dtype=np.float64)), label="Golden search")
plt.xlabel(r"Number of iterations, $k$", fontsize=24)
plt.ylabel("Error rate upper bound", fontsize=24)
plt.legend(loc="best", fontsize=24)
plt.xticks(fontsize = 24)
_ = plt.yticks(fontsize = 24)



In [8]:

    
%timeit binary_search(f, a, b, epsilon)
%timeit golden_search(f, a, b, epsilon)









    



21.1 µs ± 280 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
84.1 µs ± 345 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Example of other behaviour of the considered methods

$$ f(x) = \sin(\sin(\sin(\sqrt{x}))), \; x \in [2, 60] $$



In [9]:

    
f = lambda x: np.sin(np.sin(np.sin(np.sqrt(x))))
x_true = (3 * np.pi / 2)**2
a = 2
b = 60
epsilon = 1e-8
plt.plot(np.linspace(a,b), f(np.linspace(a,b)))
plt.xticks(fontsize = 24)
_ = plt.yticks(fontsize = 24)
plt.title("Objective function", fontsize=20)









    Out[9]:





Text(0.5,1,'Objective function')

Comparison of convergence speed and execution time

Bisection method



In [10]:

    
left_boud_bs = []
right_bound_bs = []
approximation_bs = []

callback_bs = lambda a, b: my_callback(a, b, 
            left_boud_bs, right_bound_bs, approximation_bs)

x_opt = binary_search(f, a, b, epsilon, callback_bs)
print(np.abs(x_opt - x_true))









    



2.1968899233115735e-07

Golden section method



In [11]:

    
left_boud_gs = []
right_bound_gs = []
approximation_gs = []

cb_gs = lambda a, b: my_callback(a, b, left_boud_gs, right_bound_gs, approximation_gs)
x_gs = golden_search(f, a, b, epsilon, cb_gs)

print(np.abs(x_opt - x_true))









    



2.1968899233115735e-07

Convergence



In [12]:

    
plt.figure(figsize=(8,6))
plt.semilogy(np.abs(x_true - np.array(approximation_bs, dtype=np.float64)), label="Binary")
plt.semilogy(np.abs(x_true - np.array(approximation_gs, dtype=np.float64)), label="Golden")
plt.legend(fontsize=24)
plt.xticks(fontsize=24)
_ = plt.yticks(fontsize=24)
plt.xlabel(r"Number of iterations, $k$", fontsize=24)
plt.ylabel("Error rate upper bound", fontsize=24)









    Out[12]:





Text(0,0.5,'Error rate upper bound')

Execution time



In [13]:

    
%timeit binary_search(f, a, b, epsilon)
%timeit golden_search(f, a, b, epsilon)









    



461 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
436 µs ± 3.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Recap

Introduction to numerical optimization
General scheme of any optimization method
How to compare optimization methods
Zoo of the optimization methods and problems
One-dimensional minimization