We will implement various optimization algorithms and examine their performance for various tasks.
In [ ]:
from functools import partial
from scipy.optimize import check_grad, minimize
import numpy as np
import cvxopt as cvx
import matplotlib.pyplot as plt
%matplotlib inline
A well-known continuous, convex, smooth method is l2-norm regularized logistic regression. Which has the following objective function:
$f(w) = \frac{\lambda}{2} \|w\|^2 + \sum_{i=1}^n \log(1+\exp(-y_i\langle w, x_i \rangle))$
In order to apply gradient descent, we will further need the first derivative:
$f'(w) = \lambda w + \sum_{i=1}^n \frac{-y_i}{1+\exp(y_i(\langle w, x_i \rangle))}x_i$.
In [ ]:
def fun_l2_logistic_regression(w, X, y, param):
w = w.reshape(w.size, 1)
t1 = 1. + np.exp(-y * w.T.dot(X).T)
f = param/2.*w.T.dot(w) + np.sum(np.log(t1))
return f[0,0]
def grad_l2_logistic_regression(w, X, y, param):
w = w.reshape(w.size, 1)
t2 = 1. + np.exp(y * w.T.dot(X).T)
grad = param*w + (-y/t2).T.dot(X.T).T
return grad.ravel()
Implement a basic gradient descent solver. Plot iterations number, objective function, and stopping condition for each iteration.
In [ ]:
def findMin0(fun, grad, w, alpha, max_evals=10, eps=1e-2, verbosity=1):
# TODO
return w
Ok, let us generate some small data set to try out our optimization schemes.
In [ ]:
# Generate some test data
np.random.seed(42)
X = np.random.randn(10, 100)
w = np.random.randn(10, 1)
reg_y = w.T.dot(X)
median = np.median(reg_y)
y = -np.ones((reg_y.size, 1), dtype=np.int)
y[reg_y.ravel() >= median] = +1
We have a look at the most basic gradient descent method we can think of and start playing with the step-size $\alpha$. Try some, e.g.
What do you notice?
In [ ]:
fun = partial(fun_l2_logistic_regression, X=X, y=y, param=1.)
grad = partial(grad_l2_logistic_regression, X=X, y=y, param=1.)
print(check_grad(fun, grad, 0.0*np.random.randn(10)))
wstar = findMin0(fun, grad, 0.0*np.random.randn(10), 0.001, max_evals=1000, eps=1e-8, verbosity=0)
We do not want to tweak the $\alpha$'s for every single optimization problem. This is where line search steps in.
Implement a basic gradient descent solver with Armijo back-tracking line-search.
In [ ]:
def findMinBT(fun, grad, w, alpha, gamma, max_evals=10, eps=1e-2, verbosity=1):
# TODO
return w
In [ ]:
fun = partial(fun_l2_logistic_regression, X=X, y=y, param=1.)
grad = partial(grad_l2_logistic_regression, X=X, y=y, param=1.)
print(check_grad(fun, grad, 0.0*np.random.randn(10)))
wstar = findMinBT(fun, grad, 0.0*np.random.randn(10), 1., 0.0001, max_evals=100, verbosity=0)
More elaborate optimization methods (e.g. Newton descent) will use second-order information in order to find a better step-length. We will come back to this later.
Since this is an anomaly detection workshop, we want to train some anomaly detectors. So, here is our one-class SVM primal problem again:
$\min_{w,\rho,\xi} \frac{1}{2}\|w\|^2 - \rho + \frac{1}{n\nu} \sum_{i=1}^n \xi_i$
subject to the following constraints:
$\xi_i \geq 0\;, \quad \langle w, x_i \rangle \geq \rho - \xi_i \; , \quad \forall \; i$
This OP is unfortunately neither smooth nor unconstrained. So, lets change this.
We will get rid of the constraints by re-formulating them $\xi_i \geq 0\;, \quad \langle w, x_i \rangle \geq \rho - \xi_i \; \Rightarrow \xi_i \geq 0\;, \quad \xi_i \geq \rho - \langle w, x_i \rangle$ Since we minimize the objective, the RHS will hold with equality. Hence we can replace $\xi_i$ in the objective with the RHS. However, we need to take care also of the LHS which states that if LHS is smaller than $0$ the value should stay $0$. This can be achieved by taking a $max(0, RHS)$. Hence, we land at the following unconstrained problem:
$\min_{w,\rho} \frac{1}{2}\|w\|^2 - \rho + \frac{1}{n\nu} \sum_{i=1}^n max(0,\rho - \langle w, x_i \rangle)$,
which can be written in terms of a general loss function $\ell$ as
$\min_{w,\rho} \frac{1}{2}\|w\|^2 - \rho + \frac{1}{n\nu} \sum_{i=1}^n \ell(\rho - \langle w, x_i \rangle)$.
This is now unconstrained but still not smooth as the max (which is BTW called the hinge-loss) introduces some non-smoothness into the problem and gradient descent solvers can not readily applied. So, lets make it differentiable by approximation.
Approximating the hinge-loss by differentiable Huber-loss
$\ell_{\Delta,\epsilon}(x) :=
\left\{\begin{array}{lr}
\Delta + x, & \text{for } x \geq \Delta - \epsilon\\
\frac{(\Delta + \epsilon + x)^2}{4\epsilon}, & \text{for } \Delta - \epsilon\leq x\leq \Delta + \epsilon\\
0, & \text{else}
\end{array}\right\}$
..and the corresponding derivative is (I hope):
$\frac{\partial}{\partial x}\ell_{\Delta,\epsilon}(x) :=
\left\{\begin{array}{lr}
1, & \text{for } x \geq \Delta - \epsilon\\
\frac{(\Delta + \epsilon + x)}{2\epsilon}, & \text{for } \Delta - \epsilon\leq x\leq \Delta + \epsilon\\
0, & \text{else}
\end{array}\right\}$
For our purposes, $\Delta=0.0$ and $\epsilon=0.5$ will suffice.
(1) Implement the hinge loss $\ell(x) = \max(0,x)$
(2) Implement the Huber loss as defined above
In [ ]:
def hinge_loss(x):
# TODO
return 0
def huber_loss(x, delta, epsilon):
# TODO
return 0
In [ ]:
xs = np.linspace(-1, +1, 1000)
plt.plot(xs, hinge_loss(xs), '-r', linewidth=2.0)
plt.plot(xs, huber_loss(xs, 0., 0.5), '--b', linewidth=2.0)
plt.grid()
plt.tight_layout()
Implement the smooth one-class svm objective and derivative as defined above.
In [ ]:
def fun_smooth_ocsvm(var, X, nu, delta, epsilon):
# TODO
return 0
def grad_smooth_ocsvm(var, X, nu, delta, epsilon):
# TODO
return 0
In [ ]:
# Generate some test data
np.random.seed(42)
X = np.random.randn(10, 100)
In [ ]:
fun = partial(fun_smooth_ocsvm, X=X, nu=1.0, delta=0., epsilon=0.5)
grad = partial(grad_smooth_ocsvm, X=X, nu=1.0, delta=0., epsilon=0.5)
# First, check gradient vs numerical gradient.
# This should give very small results.
print(check_grad(fun, grad, 0.1*np.random.randn(10+1)))
xstar = findMinBT(fun, grad, 0.0*np.random.randn(10+1), 1., 0.0001, max_evals=1000, eps=1e-4, verbosity=0)
wstar = xstar[1:]
print(wstar)
print(np.mean(X, axis=1))
print(np.linalg.norm(wstar - np.mean(X, axis=1)))
Here, we gonna implement a version of the primal one-class SVM with a $\ell_p$-norm regularizer. This will allow us to control the sparsity of the found solution vector:
$\min_{w,\rho} \frac{1}{2}\|w\|_p^2 - \rho + \frac{1}{n\nu} \sum_{i=1}^n max(0,\rho - \langle w, x_i \rangle)$,
The resulting optimization problem is unconstrained but non-smooth. We will use a subgradient descent solver for this problem.
In [ ]:
def findMinSG(fun, grad, x0, rate, max_evals=1000, eps=1e-2, step_method=1, verbosity=1):
dims = x0.size
x = x0
best_x = x
best_obj = np.float64(1e20)
obj_bak = -1e10
evals = 0
is_converged = False
while not is_converged and evals < max_evals:
obj = fun(x)
# this is subgradient, hence need to store the best solution so far
if best_obj >= obj:
best_x = x
best_obj = obj
# stop, if progress is too slow
if np.abs((obj-obj_bak)) < eps:
is_converged = True
continue
obj_bak = obj
# gradient step for threshold
g = grad(x)
if step_method == 1:
# constant step
alpha = rate
elif step_method == 2:
# non-summable dimishing step size
alpha = rate / np.sqrt(np.float(evals+1.))
else:
# const. step length
alpha = rate / np.linalg.norm(g)
if verbosity > 0:
print('{0} {1:5.5f} {2:5.5f} {3:5.5f}'.format(evals, alpha, obj, np.abs((obj-obj_bak))))
# update
x = x - alpha*g
evals += 1
print('{0} {1:5.5f} {2:5.5f} {3:5.5f}'.format(evals, alpha, obj, np.abs((obj-obj_bak))))
return best_x
In [ ]:
def fun_lp_norm_ocsvm(var, X, p, nu):
# TODO
return 0
def grad_lp_norm_ocsvm(var, X, p, nu):
# TODO
return var
In [ ]:
# Generate some test data
np.random.seed(42)
X = np.random.randn(10, 1000)
In [ ]:
fun = partial(fun_lp_norm_ocsvm, X=X, p=2.0, nu=1.0)
grad = partial(grad_lp_norm_ocsvm, X=X, p=2.0, nu=1.0)
xstar = findMinSG(fun, grad, np.random.randn(10+1), 0.01, max_evals=2000, eps=1e-2, step_method=1, verbosity=1)
wstar = xstar[1:]
print(wstar)
print(np.mean(X, axis=1))
print(np.linalg.norm(wstar - np.mean(X, axis=1)))
Let's have a look on how the sparsity is controlled by varying $p$.
In [ ]:
np.random.seed(42)
xs = np.array([1.0, 1.5, 2.0, 4.0, 100.0])
sparsity = np.zeros((xs.size, X.shape[0]))
for i in range(xs.size):
fun = partial(fun_lp_norm_ocsvm, X=X, p=xs[i], nu=1.0)
grad = partial(grad_lp_norm_ocsvm, X=X, p=xs[i], nu=1.0)
xstar = findMinSG(fun, grad, np.random.randn(10+1), 0.001, max_evals=5000, eps=1e-3, step_method=3, verbosity=0)
wstar = xstar[1:]
wstar = np.abs(wstar)
wstar /= np.max(wstar)
sparsity[i, :] = wstar
plt.subplot(1, xs.size, i+1)
plt.bar(np.arange(X.shape[0]), sparsity[i, :])
plt.title('p={0:1.2f}'.format(xs[i]))
plt.grid()
plt.tight_layout()
Back again at the one-class SVM primal problem:
$\min_{w,\rho,\xi} \frac{1}{2}\|w\|^2 - \rho + \frac{1}{n\nu} \sum_{i=1}^n \xi_i$
subject to the following constraints:
$\xi_i \geq 0\;, \quad \langle w, x_i \rangle \geq \rho - \xi_i \; , \quad \forall \; i$
Use cvxopt's qp method to solve this problem (cvxopt.solvers.qp(P, q, G, h)) Hence, the above problem needs to be re-written as:
$\min_x \frac{1}{2}x^T P x + q^T x$ subject to $Gx \leq h$ and $Ax=b$.
In [ ]:
def calculate_primal_qp_solution(X, nu):
# Solution vector 'x' is a concatenation of w \in R^dims, xi \inR^n, rho \in R
# and hence has a dimensionality of dims+n+rho.
# TODO
# solve qp
sol = cvx.solvers.qp(cvx.matrix(P), cvx.matrix(q), cvx.matrix(G), cvx.matrix(h))
return np.array(sol['x'])[:d].ravel(), np.array(sol['x'])[d:d+n].ravel(), np.array(sol['x'])[-1]
In [ ]:
wstar, xistar, rhostar = calculate_primal_qp_solution(X, 1.)
print('Optimized solution: ', wstar)
print('Truth: ', np.mean(X, axis=1))
print('Difference: ', np.linalg.norm(wstar - np.mean(X, axis=1)))
As you might notice, coding the derivatives is not trivial and takes the most of the time. There are some methods build-in scipy that help you with that and also optimize using more elaborate techniques such as second-order L-BFGS (a memory-limited newton descent). Here, let's recycle some of our functions...
Here is a link to the scipy 'minimize' function which implements lots of solvers for smooth (un-)constrained optimization problems:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html
We will recycle our smooth one-class SVM objective function.
Use L-BFGS-B as optimizer.
In [ ]:
# Generate some test data
np.random.seed(42)
X = np.random.randn(10, 100)
fun = partial(fun_smooth_ocsvm, X=X, nu=1.0, delta=0., epsilon=0.5)
# res: result as returned by scipy
xstar = res.x
wstar = xstar[1:]
print(wstar)
print(np.mean(X, axis=1))
print(np.linalg.norm(wstar - np.mean(X, axis=1)))
In [ ]: