Introduction to non-smooth problems and proximal methods

Non-smooth convex optimization problem

$$ \min_{x \in \mathbb{R}^n} f(x), $$

where $f(x)$ is convex, but non-smooth.

Q: since gradient does not exist, what method can solve

such problem?

Subgradient method

Replace gradient with subgradient
Original book is Minimization methods for non-differentiable functions by Naum Shor
Good review in old-school style is here
Compare the method with gradient descent

Non-descent method

Example

$$ f(x_1, x_2) = |x_1| + 2|x_2| $$

At $x = (1, 0)$ a subgradient is $(1, 2)$
The direction $-g$ is not descent

Any good news?

Let us estimate convergence of the subgradient method
$\|x_{k+1} - x^*\|^2_2 = \|x_k - \alpha_k g_k - x^*\|_2^2 \leq \|x_k - x^*\|^2_2 - 2\alpha_k g^{\top}_k(x_k - x^*) + \alpha^2_k \|g\|_2^2 \leq \|x_k - x^*\|^2_2 - 2\alpha_k (f(x_k) - f^*) + \alpha^2_k \|g\|_2^2$
Use the minimum function value after $k$ iterations $f_{best}^{(k)}$

$$ 2 \left(\sum_{i=1}^k \alpha_i\right) (f^{(k)}_{best} - f^*) \leq \| x_0 - x^*\|_2^2 - \|x_k - x^* \|_2^2 + \sum_{i=1}^k \alpha^2_i \|g_i\|_2^2 \leq \| x_0 - x^*\|_2^2 + \sum_{i=1}^k \alpha^2_i \|g_i\|_2^2 $$

Finally

$$ f^{(k)}_{best} - f^* \leq \frac{\| x_0 - x^*\|_2^2 + \sum_{i=1}^k \alpha^2_i \|g_i\|_2^2}{2\sum_{i=1}^k \alpha_i} $$

Step size selection

Fix step size $\alpha$

$$ f^{(k)}_{best} - f^* \leq \frac{\| x_0 - x^*\|_2^2 + G^2 k \alpha^2}{2 k \alpha} = \frac{\| x_0 - x^*\|_2^2}{2 k \alpha} + \frac{G^2 \alpha}{2}, $$

where $\|g_i\|_2 \leq G$

Scaled gradient norm $\alpha_i = \frac{s}{\|g_{i}\|_2}$

$$ f^{(k)}_{best} - f^* \leq \frac{G\| x_0 - x^*\|_2^2}{2 k s} + \frac{G s}{2}, $$

Diminishing step size: $\alpha_i \to 0, \sum_{i=1}^{\infty}\alpha_i = \infty$

$$ f^{(k)}_{best} - f^* \to 0. \quad k \to \infty $$

Test problem

$$ \|Ax - b\|_1 \to \min_x $$



In [1]:

    
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

USE_COLAB = False
if not USE_COLAB:
    plt.rc("text", usetex=True)

if USE_COLAB:
    !pip install git+https://github.com/amkatrutsa/liboptpy
        
import liboptpy.unconstr_solvers as methods
import liboptpy.step_size as ss


m, n = 500, 100
A = np.random.randn(m, n)
x_true = np.random.randn(n)
b = A.dot(x_true)



In [2]:

    
f = lambda x: np.linalg.norm(A.dot(x) - b, 1)
subgrad = lambda x: A.T.dot(np.sign(A.dot(x) - b))
alpha = 1e-3
s = 1e-1
sg_methods = {
    "SM 1/k": methods.fo.SubgradientMethod(f, subgrad, ss.InvIterStepSize()),
    "SM fixed={}".format(alpha): methods.fo.SubgradientMethod(f, subgrad, ss.ConstantStepSize(alpha)),
    "SM scaled fix, s={}".format(s): methods.fo.SubgradientMethod(f, subgrad, 
                                                  ss.ScaledConstantStepSize(s)),
}



In [3]:

    
x0 = np.random.randn(n)
max_iter = 500



In [4]:

    
for m in sg_methods:
    _ = sg_methods[m].solve(x0=x0, max_iter=max_iter)



In [5]:

    
plt.figure(figsize=(10, 8))
for m in sg_methods:
    plt.semilogy([f(x) for x in sg_methods[m].get_convergence()], label=m)
plt.legend(fontsize=20)
plt.xlabel(r"Number of iterations, $k$", fontsize=26)
plt.ylabel(r"Objective, $f(x_k)$", fontsize=26)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.grid(True)

Summary on subgradient method

Black-box method for non-smooth function
Natural generalization of gradient descent
But it is not descent method :(
Step size selection rules

Gradient descent as forward Euler scheme

Consider the following ODE

$$ \frac{dx}{dt} = -f'(x(t)) $$

and after discretization

$$ \frac{x_{k+1} - x_k}{\alpha} = -f'(x_k), $$

where $\alpha = t_{k+1} - t_k$ is a grid step and $x_k \equiv x(t_k)$

From this follows

$$ x_{k+1} = x_k - \alpha f'(x_k), $$

which is equivalent to gradient descent method

Properties of forward Euler scheme

Easy to use
Can be unstable if $\alpha$ is too large, which is natural drawback of gradient descent with large step size
Stability issue can be solved with backward Euler scheme

Proximal method as backward Euler scheme

$$ \frac{dx}{dt} = -f'(x(t)) $$

and after discretization in backward manner

$$ \frac{x_{k+1} - x_k}{\alpha} = -f'({\color{red}{x_{k+1}}}), $$

After some re-arrangements

\begin{align*} & \left(\frac{1}{2\alpha} \|u - x_k\|_2^2 + f(u) \right)'(x_{k+1}) = 0 \\ & x_{k+1} = \arg\min_u \left(f(u) + \frac{1}{2\alpha} \|u - x_k\|_2^2\right) = prox_{\alpha f}(x_k) \end{align*}

The method to solve positive semi-definite linear system given Cholesky factorization of $A + \epsilon I$ for some $\epsilon$ $$ f(x) = \frac{1}{2}x^{\top}Ax - b^{\top}x \to \min_x, $$ where $A \in \mathbb{S}^n_+$

\begin{align*} prox_{\alpha f} (x_k) & = \arg\min_u \left(\frac{1}{2}u^{\top}Au - b^{\top}u + \frac{1}{2\alpha} \|u - x_k\|_2^2\right) \\ & = \left(A + (1/ \alpha) I\right)^{-1}\left(b + (1 / \alpha)x_k\right) \\ & = x_k + (A + \epsilon I)^{-1}(b - Ax_k) \end{align*}

Properties of proximal map

$$ x_{k+1} = prox_{f}(x_k) $$

Fixed point of proximal operator is equal to minimizer of $f$

$$ x^* = prox_f(x^*) $$

Proximal map is not a contraction

There exists theory on firm non-expansiveness of proximal operator

Resolvent of operator

$$ prox_{\alpha f}(x) = (I + \alpha f')^{-1}(x) $$

This property can be extended to subdifferential.

Modified gradient step interpretation

Assume $f$ is twice differentiable and strong convex ($f''(x) \succ 0$).

Gradient method as approximation of proximal method

$$ x_{k+1} = prox_{\alpha f} (x_k) = (I + \alpha f')^{-1}(x_k) \approx x_k - \alpha f'(x_k) + o(\alpha), \; \alpha \to 0 $$

Proximal method for second-order approximation $\hat{f}(v) = f(x) + \langle f'(x), v - x \rangle + \frac{1}{2}\langle v - x, f''(x) (v - x) \rangle$

$$ prox_{\alpha \hat{f}}(x_k) = x_k - (f''(x_k) + (1 / \alpha)I)^{-1}f'(x_k) $$

Separable sum

If $f$ has the form

$$ f(x) = \sum_{i=1}^n f_i(x_i) $$

then $$ prox_f(v)_i = prox_{f_i}(v_i) $$

High parallelization of the proximal operator computing
Consensus form of the problem $$ \min \sum_{i=1}^n f_i(x) \to \min \sum_{i=1}^n f_i(x_i), \; \text{s.t. } x_1 = \ldots = x_n $$

Proximal gradient method (PGM) - look into black box

Consider convex function $f$ such that

$$ f(x) = h(x) + g(x), $$

where $h(x)$ is convex differentiable, and $g(x)$ can be convex extended-value function, so $g: \mathbb{R}^n \to \mathbb{R} \cup \{+\infty\}$

Then the one step of the proximal gradient method is

$$ x_{k+1} = prox_{\alpha_k g} (x_k - \alpha_k h'(x_k)) $$

Convergence rate $\mathcal{O}(1/k)$ for step size $\alpha_k \equiv \alpha \in (0, 1/L]$, where $L$ is Lipschitz constant of $f'$
Backtracking procedure is available for adaptive search of $\alpha_k$

Pacticular cases of PGM

If $h \equiv 0$, then PGM is simple proximal method
If $g \equiv 0$, then PGM is vanilla gradient descent method

Fixed point iteration interpretation

From the first-order optimality condition follows

\begin{align*} & 0 \in \alpha h'(x^*) + \alpha\partial g(x^*)\\ & 0 \in \alpha h'(x^*) + x^* - x^* + \alpha\partial g(x^*)\\ & (I - \alpha h')(x^*) \in (I + \alpha \partial g)(x^*)\\ & x^* = (I + \alpha \partial g)^{-1}(I - \alpha h')(x^*)\\ & x^* = prox_{\alpha g}(x^* - \alpha h'(x^*)) \end{align*}

Accelerated proximal gradient method

\begin{align*} & y_{k+1} = x_k + w_k (x_k - x_{k+1})\\ & x_{k+1} = prox_{\alpha_k g} (y_{k+1} - \alpha_k h'(y_{k+1})) \end{align*}

$\{w_k\}$ is special sequence that has to be chosen to accelerate convergence, for example $w_k = \frac{k}{k+3}$
Similar backtracking procedure is helpful for adaptive search $\alpha_k$
Convergence rate $\mathcal{O}(1/k^2)$ which is tight lower bound for convex function

Lasso example

$$ \min_x \frac{1}{2}\|Ax - y \|^2_2 + \gamma \|x\|_1 $$

$h(x) = \frac{1}{2}\|Ax - y \|^2_2$
$g(x) = \gamma \|x\|_1$

Soft thresholding

$\|x\|_1 = |x_1| + \ldots + |x_n|$
Separable sum
$prox_{\alpha \|x\|_1} (x)_i = prox_{\alpha |\cdot|}(x_i)$
$prox_{\alpha |\cdot|} (x_i) = \arg\min_u (|u| + 1/(2\alpha)\|x_i - u\|^2_2)$
Analytical solution $$ prox_{\alpha |\cdot|} (x_i) = \begin{cases} x_i - \alpha & x_i \geq \alpha \\ 0 & |x_i| \leq \alpha \\ x_i + \alpha & x_i \leq -\alpha \end{cases} $$
Vectorized form $$ prox_{\alpha \|x\|_1} (x) = \text{sign}(x)(|x| - \alpha)_+, \; |x| = (|x_1|, \ldots, |x_n|) $$

(F)ISTA

Proximal gradient method for Lasso problem is known as ISTA (Iterative shrinkage-thresholding algorithm)
Fast proximal gradient is known as FISTA (Fast ISTA)

Experiments



In [6]:

    
def prox_alg(f, f_grad, g_prox, x0, num_iter, beta=0.5, fix_lam=None, accel=False):
    conv = [x0]
    x = x0.copy()
    if accel:
        t_prev = 1
        t_next = (1 + np.sqrt(1 + 4 * t_prev**2)) / 2.
    if fix_lam is None:
        lam = 1.
    for i in range(num_iter):
        if accel and i > 0:
            x = x + (t_prev - 1) / t_next * (x - conv[-2])
            t_prev = t_next
            t_next = (1 + np.sqrt(1 + 4 * t_prev**2)) / 2.
        if fix_lam is None:
            while True:
                z = g_prox(x - lam * f_grad(x), lam)
                if f(z) <= f(x) + f_grad(x).dot(z - x) + np.linalg.norm(z - x)**2 / (2 * lam):
                    break
                else:
                    lam = lam * beta
        else:
            z = g_prox(x - fix_lam * f_grad(x), fix_lam)
        x = z.copy()
        conv.append(x)
    return x, conv



In [7]:

    
f = lambda x: np.linalg.norm(A.dot(x) - y)**2 / 2
f_grad = lambda x: A.T.dot(A.dot(x) - y)
f_subgrad = lambda x: A.T.dot(A.dot(x) - y) + gamma * np.sign(x)

def g_prox(h, lam):
    return np.sign(h) * np.maximum(np.abs(h) - lam * gamma, 0)



In [8]:

    
import sklearn.preprocessing as skprep
m = 500
n = 2500
A = np.random.rand(m, n)
A = skprep.normalize(A, norm="l2", axis=0)
L = np.linalg.eigvalsh(A.T.dot(A)).max()
x_true = np.random.randn(n)
x_true[np.random.rand(n) < 0.96] = 0
print("Number of nonzeros in x_true = {}".format(np.sum(x_true != 0)))
v = 9e-2 * np.random.randn(m)
y = A.dot(x_true) + v
gamma_max = np.linalg.norm(A.T.dot(y), ord=np.inf)
gamma = 0.1 * gamma_max
print("Gamma = {}".format(gamma))
print("S2N ratio = {}".format(np.linalg.norm(A.dot(x_true))**2 / np.linalg.norm(v)**2))
f_star = f(x_true) + gamma*np.linalg.norm(x_true, 1)









    



Number of nonzeros in x_true = 107
Gamma = 0.3637573290728589
S2N ratio = 8.83535922792507



In [9]:

    
num_iter = 500
# x0 = np.zeros(n)
x0 = np.random.rand(n)



In [10]:

    
x, conv = prox_alg(f, f_grad, g_prox, x0, num_iter, fix_lam=None)
print("Number nonzeros in x* = {}".format(np.sum(x != 0)))
print("f* = {}".format(f(x) + gamma * np.linalg.norm(x, 1)))
x_acc, conv_acc = prox_alg(f, f_grad, g_prox, x0, num_iter, fix_lam=None, accel=True)
print("Number nonzeros in x* = {}".format(np.sum(x_acc != 0)))
print("f* = {}".format(f(x_acc) + gamma * np.linalg.norm(x_acc, 1)))









    



Number nonzeros in x* = 2023
f* = 159.51745043436227
Number nonzeros in x* = 37
f* = 13.13430160141383



In [11]:

    
subgrad_m = methods.fo.SubgradientMethod(f, f_subgrad, ss.ConstantStepSize(1e-3))
x_subgrad = subgrad_m.solve(x0=x0, max_iter=num_iter)
print("f^* = {}".format(f(x_subgrad)))









    



f^* = 9.447900613204844



In [12]:

    
plt.figure(figsize=(10, 8))
fontsize=25
plt.plot([f(x) + gamma * np.linalg.norm(x, 1) for x in conv], label="ISTA")
plt.plot([f(x) + gamma * np.linalg.norm(x, 1) for x in conv_acc], label="FISTA")
plt.plot([f(x) + gamma * np.linalg.norm(x, 1) for x in subgrad_m.get_convergence()], label="Subgradient method")
plt.legend(fontsize=fontsize)
plt.xlabel(r"Number of iterations, $k$", fontsize=fontsize)
plt.ylabel(r"Lasso objective, $h(x_k) + g(x_k)$", fontsize=fontsize)
plt.xticks(fontsize=fontsize)
_ = plt.yticks(fontsize=fontsize)
plt.yscale("log")

Matrix decomposition

\begin{align*} & \min \phi_1(X_1) + \gamma_2 \phi_2(X_2) + \ldots + \gamma_N \phi_N(X_N) \\ \text{s.t. } & X_1 + \ldots + X_N = A \end{align*}

Decompose matrix in the sum of well-structured matrices
Robust PCA: $A = L + S$, where
- $L$ is low-rank
- $S$ is sparse

Good presentation about robust PCA is here

Some penalties and structures being induced by them

Squared Frobenius norm gives small entries of $X$

$$ \phi(X) = \sum_{i,j} X^2_{ij} $$

Entrywise $\ell_1$ norm leads to sparsity of $X$

$$ \phi(X) = \sum_{i,j} |X_{ij}| $$

Nuclear norm gives low-rank matrix $X$

$$ \phi(X) = \|X\|_* = \sum_{i=1}^n \sigma_i(X) $$



In [13]:

    
import copy

def prox_alg_mat(f, f_grad, g_prox, X0, num_iter, beta=0.5, fix_lam=None, accel=False):
    conv = [X0]
    X = copy.copy(X0)
    if accel:
        t_prev = 1
        t_next = (1 + np.sqrt(1 + 4 * t_prev**2)) / 2.
    if fix_lam is None:
        lam = 1.
    Z = [[] for X in X0]
    for i in range(num_iter):
        if accel and i > 0:
            for i in range(len(X)):
                X[i] = X[i] + (t_prev - 1) / t_next * (X[i] - conv[-2][i])
            t_prev = t_next
            t_next = (1 + np.sqrt(1 + 4 * t_prev**2)) / 2.
        if fix_lam is None:
            current_f = f(X)
            current_grad = []
            for grad in f_grad:
                current_grad.append(grad(X))
            while True:
                for i, g in enumerate(g_prox):
                    Z[i] = g(X[i] - lam * current_grad[i], lam)
                lin_term = 0
                for i in range(len(Z)):
                    lin_term += np.trace(current_grad[i].T.dot(Z[i] - X[i]))
                quad_term = 0
                for i in range(len(Z)):
                    quad_term += np.linalg.norm(Z[i] - X[i])**2
                if f(Z) <= current_f + lin_term + quad_term / (2 * lam):
                    break
                else:
                    lam = lam * beta
        else:
            for i, g in enumerate(g_prox):
                Z[i] = g(X[i] - fix_lam * f_grad[i](X), fix_lam)
        X = copy.copy(Z)
        conv.append(X)
    return X, conv



In [18]:

    
m = 100
n = 100
rank = 4
L = np.random.randn(m, rank).dot(np.random.randn(rank, n))
S = 20 * np.random.rand(m * n) - 10
S[np.random.rand(m*n) < 0.9] = 0
print(np.sum(S != 0))
S = S.reshape(m, n)
V = 0.05 * np.random.randn(m, n)
A = L + S + V
max_gamma_3 = np.linalg.norm(A, ord="nuc")
gamma_3 = 0.01 * max_gamma_3
print(gamma_3)
max_gamma_2 = np.max(np.abs(A))
gamma_2 = 0.15 * max_gamma_2
print(gamma_2)









    



1034
18.338963029174305
2.419384168646618



In [26]:

    
import cvxpy as cvx

L_c = cvx.Variable((m, n))
S_c = cvx.Variable((m, n))
problem = cvx.Problem(cvx.Minimize(0.5 * cvx.norm(A  - (L_c + S_c))**2 + gamma_2 * cvx.norm(S_c, 1) + 
                                  gamma_3 * cvx.norm(L_c, "nuc")))
problem.solve(solver="SCS", verbose=True, eps=1e-5, max_iters=3000, use_indirect=True)
L_cvx = np.array(L_c.value)
S_cvx = np.array(S_c.value)
print("CVXPy optimal value = {}".format(problem.value))
print("Rank of lowrank term = {}".format(np.linalg.matrix_rank(L_cvx)))









    



----------------------------------------------------------------------------
	SCS v1.2.6 - Splitting Conic Solver
	(c) Brendan O'Donoghue, Stanford University, 2012-2016
----------------------------------------------------------------------------
Lin-sys: sparse-indirect, nnz in A = 160503, CG tol ~ 1/iter^(2.00)
eps = 1.00e-05, alpha = 1.50, max_iters = 3000, normalize = 1, scale = 1.00
Variables n = 70203, constraints m = 100303
Cones:	primal zero / dual free vars: 40000
	linear vars: 20100
	soc vars: 3, soc blks: 1
	sd vars: 40200, sd blks: 2
Setup time: 9.61e-03s
----------------------------------------------------------------------------
 Iter | pri res | dua res | rel gap | pri obj | dua obj | kap/tau | time (s)
----------------------------------------------------------------------------
     0|      inf       inf       nan      -inf       inf       inf  4.46e-02 
   100| 2.24e-03  1.59e-03  1.10e-03  5.99e+02  5.97e+02  1.68e-12  2.04e+00 
   200| 1.65e-03  1.90e-03  2.34e-03  5.58e+02  5.61e+02  4.98e-12  4.23e+00 
   300| 4.48e-04  1.08e-03  4.07e-03  5.48e+02  5.53e+02  1.69e-12  6.50e+00 
   400| 2.59e-04  6.65e-04  2.09e-03  5.50e+02  5.53e+02  1.72e-12  8.42e+00 
   500| 2.21e-04  4.44e-04  1.23e-03  5.49e+02  5.50e+02  1.73e-12  1.04e+01 
   600| 1.56e-04  2.92e-04  8.60e-04  5.48e+02  5.49e+02  1.74e-12  1.23e+01 
   700| 9.68e-05  1.96e-04  6.03e-04  5.47e+02  5.48e+02  1.75e-12  1.46e+01 
   800| 6.56e-05  1.37e-04  4.46e-04  5.47e+02  5.47e+02  1.75e-12  1.68e+01 
   900| 4.54e-05  9.81e-05  3.29e-04  5.47e+02  5.47e+02  1.76e-12  1.88e+01 
  1000| 3.24e-05  7.12e-05  2.41e-04  5.47e+02  5.47e+02  1.76e-12  2.07e+01 
  1100| 2.37e-05  5.25e-05  1.80e-04  5.47e+02  5.47e+02  5.28e-12  2.26e+01 
  1200| 1.74e-05  3.99e-05  1.40e-04  5.47e+02  5.47e+02  5.29e-12  2.48e+01 
  1300| 1.36e-05  3.15e-05  1.14e-04  5.46e+02  5.47e+02  1.76e-12  2.68e+01 
  1400| 1.04e-05  2.47e-05  8.95e-05  5.46e+02  5.47e+02  1.76e-12  2.86e+01 
  1500| 9.37e-06  2.03e-05  7.41e-05  5.46e+02  5.47e+02  1.76e-12  3.04e+01 
  1600| 6.63e-06  1.53e-05  5.44e-05  5.46e+02  5.46e+02  1.77e-12  3.22e+01 
  1700| 1.72e-05  2.19e-05  4.23e-05  5.46e+02  5.46e+02  1.77e-12  3.43e+01 
  1800| 1.33e-05  3.83e-05  2.99e-05  5.46e+02  5.46e+02  1.77e-12  3.63e+01 
  1900| 2.48e-04  1.03e-04  2.70e-04  5.47e+02  5.47e+02  5.30e-12  3.82e+01 
  2000| 1.83e-05  1.14e-05  9.56e-06  5.46e+02  5.46e+02  1.77e-12  4.02e+01 
  2100| 2.82e-05  3.18e-05  2.27e-05  5.46e+02  5.46e+02  1.77e-12  4.20e+01 
  2200| 3.69e-05  2.51e-05  1.17e-04  5.46e+02  5.46e+02  1.77e-12  4.41e+01 
  2300| 2.25e-02  2.50e-02  5.29e-04  5.46e+02  5.45e+02  1.77e-12  4.65e+01 
  2400| 5.61e-05  7.96e-05  5.83e-05  5.46e+02  5.46e+02  1.77e-12  4.84e+01 
  2500| 1.18e-05  1.49e-05  1.79e-07  5.46e+02  5.46e+02  5.30e-12  5.02e+01 
  2600| 2.32e-04  2.22e-04  2.44e-05  5.46e+02  5.46e+02  1.77e-12  5.20e+01 
  2700| 6.80e-05  1.02e-04  7.43e-06  5.46e+02  5.46e+02  1.77e-12  5.41e+01 
  2800| 3.90e-05  5.48e-05  5.89e-06  5.46e+02  5.46e+02  1.77e-12  5.62e+01 
  2900| 2.96e-05  1.57e-05  8.26e-05  5.46e+02  5.46e+02  1.77e-12  5.81e+01 
  3000| 2.00e-05  2.73e-05  7.07e-06  5.46e+02  5.46e+02  1.77e-12  5.98e+01 
----------------------------------------------------------------------------
Status: Solved/Inaccurate
Hit max_iters, solution may be inaccurate
Timing: Solve time: 5.98e+01s
	Lin-sys: avg # CG iterations: 7.93, avg solve time: 1.07e-02s
	Cones: avg projection time: 7.41e-03s
----------------------------------------------------------------------------
Error metrics:
dist(s, K) = 7.2749e-05, dist(y, K*) = 2.4021e-07, s'y/|s||y| = 2.3852e-11
|Ax + s - b|_2 / (1 + |b|_2) = 2.0040e-05
|A'y + c|_2 / (1 + |c|_2) = 2.7338e-05
|c'x + b'y| / (1 + |c'x| + |b'y|) = 7.0721e-06
----------------------------------------------------------------------------
c'x = 546.3550, -b'y = 546.3473
============================================================================
CVXPy optimal value = 546.354989224373
Rank of lowrank term = 100



In [22]:

    
def f(X):
    Y = np.zeros_like(X[0])
    for y in X:
        Y = Y + y
    return np.linalg.norm(A - Y, ord="fro")**2 * 0.5

def grad_f_L(X):
    return X[0] + X[1] - A

def grad_f_S(X):
    return X[0] + X[1] - A

f_grad = [grad_f_S, grad_f_L]

def entry_l1_prox(X, lam):
    return np.sign(X) * np.maximum(np.abs(X) - gamma_2 * lam, 0) 

def nuclear_norm_prox(X, lam):
    U, sigma, V = np.linalg.svd(X, full_matrices=False)
    return U.dot(np.diag(np.maximum(sigma - lam * gamma_3, 0))).dot(V)

g_prox = [entry_l1_prox, nuclear_norm_prox]



In [23]:

    
# X0 = [np.random.randn(m, n), np.random.randn(m, n)]
X0 = [np.zeros((m, n)), np.zeros((m, n))]
num_iter = 600



In [24]:

    
step = 5e-3
X_pm, conv = prox_alg_mat(f, f_grad, g_prox, X0, num_iter, beta=0.5, fix_lam=step, accel=False)
print("f* = {}".format(f(X_pm) + gamma_2 * np.sum(np.abs(X_pm[0])) + gamma_3 * np.linalg.norm(X_pm[1], "nuc")))
print("Rank of the low-rank term = {}".format(np.linalg.matrix_rank(X_pm[1])))
print("Number of nnz in sparse term = {}".format(np.sum(X_pm[0] != 0)))
X_pm_acc, conv_acc = prox_alg_mat(f, f_grad, g_prox, X0, num_iter, beta=0.5, fix_lam=step, accel=True)
print("f* = {}".format(f(X_pm_acc) + gamma_2 * np.sum(np.abs(X_pm_acc[0])) + 
                       gamma_3 * np.linalg.norm(X_pm_acc[1], "nuc")))
print("Rank of the low-rank term = {}".format(np.linalg.matrix_rank(X_pm_acc[1])))
print("Number of nnz in sparse term = {}".format(np.sum(X_pm_acc[0] != 0)))









    



f* = 15617.653782762085
Rank of the low-rank term = 7
Number of nnz in sparse term = 840
f* = 15471.882342832263
Rank of the low-rank term = 4
Number of nnz in sparse term = 779



In [25]:

    
plt.figure(figsize=(10, 8))
fontsize=25
plt.plot([f(X) + gamma_2 * np.sum(np.abs(X[0])) + gamma_3 * np.linalg.norm(X[1], "nuc") for X in conv], label="PGM")
plt.plot([f(X) + gamma_2 * np.sum(np.abs(X[0])) + gamma_3 * np.linalg.norm(X[1], "nuc") for X in conv_acc], label="FPGM")
plt.legend(fontsize=fontsize)
plt.xlabel(r"Number of iterations, $k$", fontsize=fontsize)
plt.ylabel(r"Objective", fontsize=fontsize)
plt.xticks(fontsize=fontsize)
_ = plt.yticks(fontsize=fontsize)
plt.yscale("log")

Application in video processing

Stacked matrix with frames
Decompose it as sparse + low-rank term
Sparse term consists of vectorized foreground part of frames
Low-rank term consists of vectorized background part of frames

Further directions

Non-euclidean distance used as penalty, i.e. Bregman divergence and Mirror descent method
Non-convex optimization
Parallel implementations and distributed setting
Use different discretization methods from ODE-related science to get new fast optimization methods, see this paper

References

Proximal algorithms and related sources by N. Parikh and S. Boyd
Gradient-based algorithms with applications to signal recovery problems by A. Beck and M. Teboulle
Proximal Backpropagation by T Frerix et al (ICLR 2018)
Stochastic Proximal Gradient Descent with Acceleration Techniques by A. Nitanda (NIPS 2014)

Summary

Proximal method
Proximal map and its property
Proximal gradient method and its acceleration
Lasso application