Exercise: Optimization


We will implement various optimization algorithms and examine their performance for various tasks.

  1. First-order, smooth optimization using gradient descent
    • Implement basic gradient descent solver
    • Implement gradient descent with armijo backtracking
  2. Smooth One-class SVM
    • Implement hinge- and Huber loss functions
    • Implement objective and derivative of the smooth one-class svm
    • Use check_grad to verify the implementation
  3. First-order, non-smooth optimization using sub-gradient descent
    • Implement objective and derivative of $\ell_p$-norm regularized one-class svm
  4. Utilizing Available QP Solver Packages: CVXOPT
    • Use cvxopt qp solver to solve the primal one-class svm optimization problem
  5. Utilizing Available Solver Packages: SciPy's Optimization Suite
    • Apply scipy's minimize function on your implementation of the objective function of the smooth one-class svm

In [ ]:
from functools import partial
from scipy.optimize import check_grad, minimize

import numpy as np
import cvxopt as cvx

import matplotlib.pyplot as plt
%matplotlib inline

1. First-order, smooth optimization using gradient descent

In this first part, we want use various variants of gradient descent for continuous and smooth optimization.

A well-known continuous, convex, smooth method is l2-norm regularized logistic regression. Which has the following objective function:

$f(w) = \frac{\lambda}{2} \|w\|^2 + \sum_{i=1}^n \log(1+\exp(-y_i\langle w, x_i \rangle))$

In order to apply gradient descent, we will further need the first derivative:

$f'(w) = \lambda w + \sum_{i=1}^n \frac{-y_i}{1+\exp(y_i(\langle w, x_i \rangle))}x_i$.


In [ ]:
def fun_l2_logistic_regression(w, X, y, param):
    w = w.reshape(w.size, 1)
    t1 = 1. + np.exp(-y * w.T.dot(X).T)
    f = param/2.*w.T.dot(w) + np.sum(np.log(t1))
    return f[0,0]

def grad_l2_logistic_regression(w, X, y, param):
    w = w.reshape(w.size, 1)
    t2 = 1. + np.exp(y * w.T.dot(X).T)
    grad = param*w + (-y/t2).T.dot(X.T).T
    return grad.ravel()

Implement a basic gradient descent solver. Plot iterations number, objective function, and stopping condition for each iteration.


In [ ]:
def findMin0(fun, grad, w, alpha, max_evals=10, eps=1e-2, verbosity=1):
    # TODO 
    return w

Ok, let us generate some small data set to try out our optimization schemes.


In [ ]:
# Generate some test data
np.random.seed(42)
X = np.random.randn(10, 100)
w = np.random.randn(10, 1)
reg_y = w.T.dot(X)
median = np.median(reg_y)
y = -np.ones((reg_y.size, 1), dtype=np.int)
y[reg_y.ravel() >= median] = +1

We have a look at the most basic gradient descent method we can think of and start playing with the step-size $\alpha$. Try some, e.g.

  • $\alpha=1.0$
  • $\alpha=1e-6$
  • $\alpha=0.001$

What do you notice?


In [ ]:
fun = partial(fun_l2_logistic_regression, X=X, y=y, param=1.)
grad = partial(grad_l2_logistic_regression, X=X, y=y, param=1.)

print(check_grad(fun, grad, 0.0*np.random.randn(10)))

wstar = findMin0(fun, grad, 0.0*np.random.randn(10), 0.001, max_evals=1000, eps=1e-8, verbosity=0)

We do not want to tweak the $\alpha$'s for every single optimization problem. This is where line search steps in.

Implement a basic gradient descent solver with Armijo back-tracking line-search.


In [ ]:
def findMinBT(fun, grad, w, alpha, gamma, max_evals=10, eps=1e-2, verbosity=1):
    # TODO
    return w

In [ ]:
fun = partial(fun_l2_logistic_regression, X=X, y=y, param=1.)
grad = partial(grad_l2_logistic_regression, X=X, y=y, param=1.)

print(check_grad(fun, grad, 0.0*np.random.randn(10)))
wstar = findMinBT(fun, grad, 0.0*np.random.randn(10), 1., 0.0001, max_evals=100, verbosity=0)

More elaborate optimization methods (e.g. Newton descent) will use second-order information in order to find a better step-length. We will come back to this later.

2. Smooth One-class SVM

Since this is an anomaly detection workshop, we want to train some anomaly detectors. So, here is our one-class SVM primal problem again:

$\min_{w,\rho,\xi} \frac{1}{2}\|w\|^2 - \rho + \frac{1}{n\nu} \sum_{i=1}^n \xi_i$

subject to the following constraints:

$\xi_i \geq 0\;, \quad \langle w, x_i \rangle \geq \rho - \xi_i \; , \quad \forall \; i$

This OP is unfortunately neither smooth nor unconstrained. So, lets change this.

  1. We will get rid of the constraints by re-formulating them $\xi_i \geq 0\;, \quad \langle w, x_i \rangle \geq \rho - \xi_i \; \Rightarrow \xi_i \geq 0\;, \quad \xi_i \geq \rho - \langle w, x_i \rangle$ Since we minimize the objective, the RHS will hold with equality. Hence we can replace $\xi_i$ in the objective with the RHS. However, we need to take care also of the LHS which states that if LHS is smaller than $0$ the value should stay $0$. This can be achieved by taking a $max(0, RHS)$. Hence, we land at the following unconstrained problem:

    $\min_{w,\rho} \frac{1}{2}\|w\|^2 - \rho + \frac{1}{n\nu} \sum_{i=1}^n max(0,\rho - \langle w, x_i \rangle)$,

    which can be written in terms of a general loss function $\ell$ as

    $\min_{w,\rho} \frac{1}{2}\|w\|^2 - \rho + \frac{1}{n\nu} \sum_{i=1}^n \ell(\rho - \langle w, x_i \rangle)$.

This is now unconstrained but still not smooth as the max (which is BTW called the hinge-loss) introduces some non-smoothness into the problem and gradient descent solvers can not readily applied. So, lets make it differentiable by approximation.


  1. Approximating the hinge-loss by differentiable Huber-loss

    $\ell_{\Delta,\epsilon}(x) :=

     \left\{\begin{array}{lr}
     \Delta + x, & \text{for } x \geq \Delta - \epsilon\\
     \frac{(\Delta + \epsilon + x)^2}{4\epsilon}, & \text{for } \Delta - \epsilon\leq x\leq \Delta + \epsilon\\
     0, & \text{else}
     \end{array}\right\}$
    
    

    ..and the corresponding derivative is (I hope):

    $\frac{\partial}{\partial x}\ell_{\Delta,\epsilon}(x) :=

     \left\{\begin{array}{lr}
     1, & \text{for } x \geq \Delta - \epsilon\\
     \frac{(\Delta + \epsilon + x)}{2\epsilon}, & \text{for } \Delta - \epsilon\leq x\leq \Delta + \epsilon\\
     0, & \text{else}
     \end{array}\right\}$
    
    

    For our purposes, $\Delta=0.0$ and $\epsilon=0.5$ will suffice.

(1) Implement the hinge loss $\ell(x) = \max(0,x)$

(2) Implement the Huber loss as defined above


In [ ]:
def hinge_loss(x):
    # TODO
    return 0
    
def huber_loss(x, delta, epsilon):
    # TODO
    return 0

In [ ]:
xs = np.linspace(-1, +1, 1000)

plt.plot(xs, hinge_loss(xs), '-r', linewidth=2.0)
plt.plot(xs, huber_loss(xs, 0., 0.5), '--b', linewidth=2.0)

plt.grid()
plt.tight_layout()

Implement the smooth one-class svm objective and derivative as defined above.


In [ ]:
def fun_smooth_ocsvm(var, X, nu, delta, epsilon):
    # TODO
    return 0
    
def grad_smooth_ocsvm(var, X, nu, delta, epsilon):
    # TODO
    return 0

In [ ]:
# Generate some test data
np.random.seed(42)
X = np.random.randn(10, 100)

In [ ]:
fun = partial(fun_smooth_ocsvm, X=X, nu=1.0, delta=0., epsilon=0.5)
grad = partial(grad_smooth_ocsvm, X=X, nu=1.0, delta=0., epsilon=0.5)

# First, check gradient vs numerical gradient.
# This should give very small results.
print(check_grad(fun, grad, 0.1*np.random.randn(10+1)))

xstar = findMinBT(fun, grad, 0.0*np.random.randn(10+1), 1., 0.0001, max_evals=1000, eps=1e-4, verbosity=0)
wstar = xstar[1:]

print(wstar)
print(np.mean(X, axis=1))
print(np.linalg.norm(wstar - np.mean(X, axis=1)))

3. First-order, non-smooth optimization using sub-gradient descent

Unfortunately, many interesting methods do contain a non-smooth part in their objective. Examples include support vector machines (SVMs), one-class support vector machines (OCSVM), and support vector data descriptions.

Here, we gonna implement a version of the primal one-class SVM with a $\ell_p$-norm regularizer. This will allow us to control the sparsity of the found solution vector:

$\min_{w,\rho} \frac{1}{2}\|w\|_p^2 - \rho + \frac{1}{n\nu} \sum_{i=1}^n max(0,\rho - \langle w, x_i \rangle)$,

The resulting optimization problem is unconstrained but non-smooth. We will use a subgradient descent solver for this problem.


In [ ]:
def findMinSG(fun, grad, x0, rate, max_evals=1000, eps=1e-2, step_method=1, verbosity=1):
    dims = x0.size
    x = x0
    best_x = x
    best_obj = np.float64(1e20)
    obj_bak = -1e10
    evals = 0
    is_converged = False
    while not is_converged and evals < max_evals:
        obj = fun(x)
        # this is subgradient, hence need to store the best solution so far
        if best_obj >= obj:
            best_x = x
            best_obj = obj

        # stop, if progress is too slow
        if np.abs((obj-obj_bak)) < eps:
            is_converged = True
            continue
        obj_bak = obj
        
        # gradient step for threshold
        g = grad(x)
        if step_method == 1:
            # constant step
            alpha = rate
        elif step_method == 2:
            # non-summable dimishing step size
            alpha = rate / np.sqrt(np.float(evals+1.))
        else:
            # const. step length
            alpha = rate / np.linalg.norm(g)
            
        if verbosity > 0:
            print('{0} {1:5.5f} {2:5.5f} {3:5.5f}'.format(evals, alpha, obj, np.abs((obj-obj_bak))))           

        # update
        x = x - alpha*g
        evals += 1

    print('{0} {1:5.5f} {2:5.5f} {3:5.5f}'.format(evals, alpha, obj, np.abs((obj-obj_bak))))           
    return best_x

In [ ]:
def fun_lp_norm_ocsvm(var, X, p, nu):
    # TODO
    return 0
    
def grad_lp_norm_ocsvm(var, X, p, nu):
    # TODO
    return var

In [ ]:
# Generate some test data
np.random.seed(42)
X = np.random.randn(10, 1000)

In [ ]:
fun = partial(fun_lp_norm_ocsvm, X=X, p=2.0, nu=1.0)
grad = partial(grad_lp_norm_ocsvm, X=X, p=2.0, nu=1.0)

xstar = findMinSG(fun, grad, np.random.randn(10+1), 0.01, max_evals=2000, eps=1e-2, step_method=1, verbosity=1)
wstar = xstar[1:]

print(wstar)
print(np.mean(X, axis=1))
print(np.linalg.norm(wstar - np.mean(X, axis=1)))

Let's have a look on how the sparsity is controlled by varying $p$.


In [ ]:
np.random.seed(42)
xs = np.array([1.0, 1.5, 2.0, 4.0, 100.0])
sparsity = np.zeros((xs.size, X.shape[0]))
for i in range(xs.size):
    fun = partial(fun_lp_norm_ocsvm, X=X, p=xs[i], nu=1.0)
    grad = partial(grad_lp_norm_ocsvm, X=X, p=xs[i], nu=1.0)

    xstar = findMinSG(fun, grad, np.random.randn(10+1), 0.001, max_evals=5000, eps=1e-3, step_method=3, verbosity=0)
    wstar = xstar[1:]

    wstar = np.abs(wstar)
    wstar /= np.max(wstar)
    sparsity[i, :] = wstar

    plt.subplot(1, xs.size, i+1)
    plt.bar(np.arange(X.shape[0]), sparsity[i, :])
    plt.title('p={0:1.2f}'.format(xs[i]))
    plt.grid()
    
plt.tight_layout()

4. Utilizing Available QP Solver Packages: CVXOPT

There are very good general purpose solver for certain types of optimization problems available. Most important are cplex, mosek, and cvxopt where the latter is for free and contains interfaces for comercial solvers (cplex and mosek).

Back again at the one-class SVM primal problem:

$\min_{w,\rho,\xi} \frac{1}{2}\|w\|^2 - \rho + \frac{1}{n\nu} \sum_{i=1}^n \xi_i$

subject to the following constraints:

$\xi_i \geq 0\;, \quad \langle w, x_i \rangle \geq \rho - \xi_i \; , \quad \forall \; i$

Use cvxopt's qp method to solve this problem (cvxopt.solvers.qp(P, q, G, h)) Hence, the above problem needs to be re-written as:

$\min_x \frac{1}{2}x^T P x + q^T x$ subject to $Gx \leq h$ and $Ax=b$.


In [ ]:
def calculate_primal_qp_solution(X, nu):
    # Solution vector 'x' is a concatenation of w \in R^dims, xi \inR^n, rho \in R
    # and hence has a dimensionality of dims+n+rho.
    
    # TODO
    
    # solve qp
    sol = cvx.solvers.qp(cvx.matrix(P), cvx.matrix(q), cvx.matrix(G), cvx.matrix(h))
    return np.array(sol['x'])[:d].ravel(), np.array(sol['x'])[d:d+n].ravel(), np.array(sol['x'])[-1]

In [ ]:
wstar, xistar, rhostar = calculate_primal_qp_solution(X, 1.)

print('Optimized solution: ', wstar)
print('Truth: ', np.mean(X, axis=1))
print('Difference: ', np.linalg.norm(wstar - np.mean(X, axis=1)))

As you might notice, coding the derivatives is not trivial and takes the most of the time. There are some methods build-in scipy that help you with that and also optimize using more elaborate techniques such as second-order L-BFGS (a memory-limited newton descent). Here, let's recycle some of our functions...

5. Utilizing Available Solver Packages: SciPy's Optimization Suite

Here is a link to the scipy 'minimize' function which implements lots of solvers for smooth (un-)constrained optimization problems:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html

We will recycle our smooth one-class SVM objective function.

Use L-BFGS-B as optimizer.


In [ ]:
# Generate some test data
np.random.seed(42)
X = np.random.randn(10, 100)

fun = partial(fun_smooth_ocsvm, X=X, nu=1.0, delta=0., epsilon=0.5)


# res: result as returned by scipy
xstar = res.x
wstar = xstar[1:]

print(wstar)
print(np.mean(X, axis=1))
print(np.linalg.norm(wstar - np.mean(X, axis=1)))

In [ ]: