The next step:
Definition.
A set is called simple if one can compute projection on this set significantly faster (often analytically) compared with solving of the original problem
Remark.
The above definition used projection is a particular case of the more general concept of simple structure set, that used proximal mapping. More details see here
where $A^+$ is pseudoinverse matrix. If $A$ has full rank, then $A^+ = (A^{\top}A)^{-1}A^{\top}$.
where $(\lambda_i, v_i)$ is a pair of eigenvalue and corresponding eigenvector of matrix $Y$.
Idea: make step with gradient descent and project obtained point on the feasible set $P$.
where $x_k(\alpha) = \pi_P (x_k - \alpha f'(x_k))$
Theorem. Let $f$ be convex differentiable function and its gradient is Lipschitz with constant $L$. Let feasible set $P$ is convex and closed and $0 < \alpha < 2 / L$.
Then
Pro
Contra
Definition. A set $D$ is called with simple LMO, if the following problem
$$ \min_{x \in D} c^{\top}x $$can be solved significantly faster compared with original problem.
LMO is linear minimization oracle that gives the solution of the above problem.
Remark 1: the difference this definition from the previous is the linear objective function instead of the quadratic one. Therefore, the sets with simple LMO is much more than simple structure sets.
Remark 2: sometimes projection is easy to compute, but optimal objective in LMO is $-\infty$. For example, consider the set
$$ D = \{ x \in \mathbb{R}^n \; | \; x_i \geq 0 \}, $$such that projection on this set is easy to compute, but the optimal objective in linear programming problem is equal to $-\infty$, if there is at lerast one negative entry in the vector $c$. Theorem below will explain this phenomenon.
Idea: make step not along the gradient descent, but along the direction that leads to feasible point.
Coincidence with gradient descent: linear approximation in feasible set:
$$ f(x_k + s_k) = f(x_k) + \langle f'(x_k), s_k \rangle \to \min_{{\color{red}{s_k \in D}}} $$Search has to be started with $\alpha_k = 1$
Therefore
$$ f(x^*) \geq f(x) + \min_{s \in D} \langle f'(x), s - x\rangle $$or
$$ f(x) - f(x^*) \leq -\min_{s \in D} \langle f'(x), s - x\rangle = \max_{s \in D} \langle f'(x), x - s\rangle = g(x) $$We get analogue of the duality gap to control accuracy and stability of the solution.
Theorem 4.2.1. Пусть $X$ - выпуклый компакт and $f(x)$ is differentiable function on the set $X$ with Lipschitz gradient. Step size is searched accirding to Armijo rule. Then for any ${\color{red}{x_0 \in X}}$
Theorem (primal).(Convex Optimization: Algorithms and Complexity, Th 3.8.) Let $f$ be convex and differentiable function and its gradient is Lipschitz with constant $L$. A set $X$ is convex compact with diameter $d > 0$. Then Frank-Wolfe method with step size $\alpha_k = \frac{2}{k + 1}$ converges as
$$ f(x^*) - f(x_k) \leq \dfrac{2d^2L}{k + 2}, \quad k \geq 1 $$Theorem (dual) see this paper. After performing $K$ iterations of the Frank-Wolfe method for convex and smooth function, the following inequality holds for function $g(x) = \max\limits_{s \in D} \langle x - s, f'(x) \rangle $ and any $k \leq K$
$$ g(x_k) \leq \frac{2\beta C_f}{K+2} (1 + \delta), $$where $\beta \approx 3$, $\delta$ is accuracy of solving intermediate problems, $C_f$ is estimate of the function $f$ curvature on the set $D$
$$ C_f = \sup_{x, s \in D; \gamma \in [0,1]} \frac{2}{\gamma^2}\left(f(x + \gamma(s - x)) - f(x) - \langle \gamma(s - x), f'(x)\rangle\right) $$Equation in supremum operation is also known as Bregman divergence.
Definition. Atomic norm is called the following function
$$ \|x\|_{\mathcal{D}} = \inf_{t \geq 0} \{ t \; | \; x \in t\mathcal{D} \} $$It is a norm, if it is symmetric and $0 \in \mathrm{int}(\mathcal{D})$
In [1]:
def func(x, A, b):
return 0.5 * np.linalg.norm(A.dot(x) - b)**2
f = lambda x: func(x, A, b)
def grad_f(x, A, b):
grad = -A.T.dot(b)
grad = grad + A.T.dot(A.dot(x))
return grad
grad = lambda x: grad_f(x, A, b)
In [2]:
def linsolver(gradient):
x = np.zeros(gradient.shape[0])
pos_grad = gradient > 0
neg_grad = gradient < 0
x[pos_grad] = np.zeros(np.sum(pos_grad == True))
x[neg_grad] = np.ones(np.sum(neg_grad == True))
return x
In [3]:
def projection(y):
return np.clip(y, 0, 1)
In [4]:
import liboptpy.constr_solvers as cs
import liboptpy.step_size as ss
import numpy as np
from tqdm import tqdm
n = 200
m = 100
A = np.random.randn(m, n)
x_true = np.random.rand(n)
b = A.dot(x_true) + 0.01 * np.random.randn(m)
In [5]:
def myplot(x, y, xlab, ylab, xscale="linear", yscale="log"):
plt.figure(figsize=(10, 8))
plt.xscale(xscale)
plt.yscale(yscale)
for key in y:
plt.plot(x[key], y[key], label=key)
plt.xticks(fontsize=24)
plt.yticks(fontsize=24)
plt.legend(loc="best", fontsize=24)
plt.xlabel(xlab, fontsize=24)
plt.ylabel(ylab, fontsize=24)
In [6]:
x0 = np.random.rand(n)
cg = cs.FrankWolfe(f, grad, linsolver, ss.Backtracking(rule_type="Armijo", rho=0.5, beta=0.1, init_alpha=1.))
x_cg = cg.solve(x0=x0, max_iter=200, tol=1e-10, disp=1)
print("Optimal value CG =", f(x_cg))
In [7]:
pg = cs.ProjectedGD(f, grad, projection, ss.Backtracking(rule_type="Armijo", rho=0.5, beta=0.1, init_alpha=1.))
x_pg = pg.solve(x0=x0, max_iter=200, tol=1e-10, disp=1)
print("Optimal value PG =", f(x_pg))
In [8]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.rc("text", usetex=True)
y_hist_f_cg = [f(x) for x in cg.get_convergence()]
y_hist_f_pg = [f(x) for x in pg.get_convergence()]
myplot({"CG": range(1, len(y_hist_f_cg) + 1), "PG": range(1, len(y_hist_f_pg) + 1)},
{"CG": y_hist_f_cg, "PG": y_hist_f_pg}, "Number of iteration",
r"Objective function, $\frac{1}{2}\|Ax - b\|^2_2$")
In [9]:
import cvxpy as cvx
x = cvx.Variable(n)
obj = cvx.Minimize(0.5 * cvx.norm(A * x - b, 2)**2)
constr = [x >= 0, x <= 1]
problem = cvx.Problem(objective=obj, constraints=constr)
value = problem.solve()
x_cvx = np.array(x.value).ravel()
print("CVX optimal value =", value)
In [11]:
eps = [10**(-i) for i in range(8)]
time_pg = np.zeros(len(eps))
time_cg = np.zeros(len(eps))
iter_pg = np.zeros(len(eps))
iter_cg = np.zeros(len(eps))
pg = cs.ProjectedGD(f, grad, projection, ss.Backtracking(rule_type="Armijo", rho=0.5, beta=0.1, init_alpha=1.))
cg = cs.FrankWolfe(f, grad, linsolver, ss.Backtracking(rule_type="Armijo", rho=0.5, beta=0.1, init_alpha=1.))
for i, tol in tqdm(enumerate(eps)):
res = %timeit -o -q pg.solve(x0=x0, tol=tol, max_iter=100000)
time_pg[i] = res.average
iter_pg[i] = len(pg.get_convergence())
res = %timeit -o -q cg.solve(x0=x0, tol=tol, max_iter=100000)
time_cg[i] = res.average
iter_cg[i] = len(cg.get_convergence())
In [12]:
myplot({"CG":eps, "PG": eps}, {"CG": time_cg, "PG": time_pg}, r"Accuracy, $\varepsilon$", "Time, s", xscale="log")
In [13]:
myplot({"CG":eps, "PG": eps}, {"CG": iter_cg, "PG": iter_pg}, r"Accuracy, $\varepsilon$", "Number of iterations", xscale="log")
In [14]:
def linsolver(gradient):
x = np.zeros(gradient.shape[0])
idx_min = np.argmin(gradient)
if gradient[idx_min] > 0:
x[idx_min] = 0
else:
x[idx_min] = 1
return x
In [15]:
def projection(y):
x = y.copy()
if np.all(x >= 0) and np.sum(x) <= 1:
return x
x = np.clip(x, 0, np.max(x))
if np.sum(x) <= 1:
return x
n = x.shape[0]
bget = False
x.sort()
x = x[::-1]
temp_sum = 0
t_hat = 0
for i in range(n - 1):
temp_sum += x[i]
t_hat = (temp_sum - 1.0) / (i + 1)
if t_hat >= x[i + 1]:
bget = True
break
if not bget:
t_hat = (temp_sum + x[n - 1] - 1.0) / n
return np.maximum(y - t_hat, 0)
In [30]:
x0 = np.random.rand(n) * 10
x0 = x0 / x0.sum()
cg = cs.FrankWolfe(f, grad, linsolver, ss.Backtracking(rule_type="Armijo", rho=0.5, beta=0.1, init_alpha=1.))
x_cg = cg.solve(x0=x0, max_iter=200, tol=1e-10)
print("Optimal value CG =", f(x_cg))
In [31]:
pg = cs.ProjectedGD(f, grad, projection, ss.Backtracking(rule_type="Armijo", rho=0.5, beta=0.1, init_alpha=1.))
x_pg = pg.solve(x0=x0, max_iter=200, tol=1e-10)
print("Optimal value PG =", f(x_pg))
In [32]:
y_hist_f_cg = [f(x) for x in cg.get_convergence()]
y_hist_f_pg = [f(x) for x in pg.get_convergence()]
myplot({"CG": range(1, len(y_hist_f_cg) + 1), "PG": range(1, len(y_hist_f_pg) + 1)},
{"CG": y_hist_f_cg, "PG": y_hist_f_pg}, "Number of iteration",
r"Objective function, $\frac{1}{2}\|Ax - b\|^2_2$")
In [22]:
eps = [10**(-i) for i in range(8)]
time_pg = np.zeros(len(eps))
time_cg = np.zeros(len(eps))
iter_pg = np.zeros(len(eps))
iter_cg = np.zeros(len(eps))
pg = cs.ProjectedGD(f, grad, projection, ss.Backtracking(rule_type="Armijo", rho=0.5, beta=0.1, init_alpha=1.))
cg = cs.FrankWolfe(f, grad, linsolver, ss.Backtracking(rule_type="Armijo", rho=0.5, beta=0.1, init_alpha=1.))
for i, tol in tqdm(enumerate(eps)):
res = %timeit -o -q pg.solve(x0=x0, tol=tol, max_iter=100000)
time_pg[i] = res.average
iter_pg[i] = len(pg.get_convergence())
res = %timeit -o -q cg.solve(x0=x0, tol=tol, max_iter=100000)
time_cg[i] = res.average
iter_cg[i] = len(cg.get_convergence())
In [23]:
myplot({"CG":eps, "PG": eps}, {"CG": time_cg, "PG": time_pg},
r"Accuracy, $\varepsilon$", "Time, s", xscale="log")
In [24]:
myplot({"CG":eps, "PG": eps}, {"CG": iter_cg, "PG": iter_pg},
r"Accuracy, $\varepsilon$", "Number of iterations", xscale="log")
In [25]:
x = cvx.Variable(n)
obj = cvx.Minimize(0.5 * cvx.norm2(A * x - b)**2)
constr = [cvx.norm(x, 1) <= 1, x >= 0]
problem = cvx.Problem(objective=obj, constraints=constr)
value = problem.solve()
x_cvx = np.array(x.value).ravel()
print("CVX optimal value =", value)
Pro
Contra