4. Non-linear Classification

Nonlinear transformation on data, then just apply a linear classifier
We will use this kernel trick with our SVMs
In order to be able to do this, we will have to reformulate one of our primal SVM formulations into a dual formulation

Background: Convexity

Gathered from multiple sources, such as Wikipedia, the ETH DM lecture slides, mmds.org, and Stanford lectures on convex optimization (EE 364A, available on YouTube).

Positive semidefiniteness

Only defined for square matrices (obvious, see definition below)
$X \in \mathbb{R}^{n \times n}$ positive definite $\iff z^TXz \ge 0 \> \forall z$
semi = non-strict inequality
Analogous definitions for negative and for non-semi variants

Affine and convex sets

An affine set always also contains the line through any two points in itself.
- Any two points $x_1, x_2$; $x = \theta x_1 + (1 - \theta)x_2, \theta \in \mathbb{R}$
- example: solution set of linear equations $\{x \> | \> Ax = b \}$.
A convex set always contains the line segment between any two points in itself.
- Any two points $x_1, x_2$; $x = \theta x_1 + (1 - \theta)x_2, \theta \in [0, 1]$
- example: from geometry, a square-shape is convex, a donut isn't
- Intuition: an affine set is a more general form of a convex set
- Any affine set is also a convex set.

Convex combination and convex hull

A convex combination of points $x_1,\dots,x_k$: any point $x$ of the form:
- $ x = \theta_1x_1 + \theta_2x_2 + \dots + \theta_kx_k $
- with $\theta_1 + \dots + \theta_k = 1, \> \theta_i \ge 0$
- omitting the second constraint $\implies$ linear combination
A convex hull of a set S ($\operatorname{conv} S$) is the set of all convex combination points in S.

Hyperplanes and halfspaces

A hyperplane is a set of the form $\{x \> | \> a^Tx = b\}, (a \ne 0)$
- $a$ is the normal, $b$ is the intercept (or bias)
- all hyperplanes are affine (and convex)
A halfspace is a set of the form $\{x \> | \> a^x < b\}, (a \ne 0)$
- same terminology as for hyperplanes
- halfspaces are not vector spaces (not closed under scalar multiplication; we can easily "escape it")
- all halfspaces are convex

Euclidean balls and ellipsoids

Euclidean = using the $l_2$ (Euclidean) norm in their description
An (Euclidean) ball of center $x_c$ and radius $r$ is
- $B(x_c, r) = \left\{ x \> | \> \| x - x_c \|_2 \le r \right\} = \left\{x_c + ru \> | \> \|u\|_2 \le 1 \right\}$
An ellipsoid is a set of form:
- $\left\{x\>|\> (x - x_c)^{T} P^{-1} (x - x_c) \le 1 \right\}$, with $P \in \mathbf{S}^n_{++}$ ($P$ symmetric, positive definite matrix)
- generalization of the Euclidean ball
Both are convex sets

Norms

Any function $\| \cdot \|$ which satisfies
- positive; $\|x\| = 0 \iff x = 0$
- $\|tx\| = \lvert t \rvert \|x\| \> \forall t \in \mathbb{R}$
- $\|x + y \| \le \|x\| + \|y\|$
Can define a norm ball with any norm and radius. The Euclidean ball from the previous section was a particular example.
- in $\mathbb{R}^2$, $\| \cdot \|_2$ is a diamond shape centered around the origin; $\| \cdot \|_\infty$ is a square.

Polyhedra

Solution sets of finitely many linear equalities and inequalities
- $Ax \preccurlyeq b$ (component-wise less than, since we're shoving a bunch of inequalities in a big matrix-vector multiplication followed by a vector-vector comparison) and $Cx = d$
Convex
Polyhedron is intersection of finite number of halfspaces and hyperplanes
Halfspaces and hyperplanes are special cases of polyhedra
Equalities turn our resulting polyhedron into a "slice" in that dimension

Preserving convexity

Practical methods for establishing convexity of a set $C$

Apply definition $x_1, x_2 \in C, \> 0 \le \theta \le 1 \implies \theta x_1 + (1 - \theta)x_2 \in C$
Show that $C$ is obtained from simple convex sets (hyperplanes, halfspaces, norm balls, etc.) by operations that preserve convexity
- intersection (any number); the intersection of any number of convex sets is convex
- affine functions
- perspective functions
- linear-fractional functions
"Matlab" approach
- Pick some random points from the space
- Test points by brute force (or maybe just keep trying with $\theta = 0.5$)

Affine functions

Suppose $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$ is affine ($\> f(x) = Ax + b, A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^m$)
$f$ turns convex sets into convex sets
$f^{-1}$ does the same
Examples: scaling, translation, projection

Perspective and linear-fractional functions

A perspective function $P : \mathbb{R}^{n+1} \rightarrow \mathbb{R}^{n}$:
- $P(x, t) = x / t$ (divide first n - 1 by last element and discard it)
- Perspective functions and their inverses conserve convexity
A linear-fractional function
- A generalization of the perspective function.
- Maintains convexity

Generalized inequalities and cones

A simple ray from the origin is not a proper cone.
The nonnegative orthant is a proper cone.
Can define point inequalities in proper cones. However, they don't define a general ordering.
Minimum vs. minimal has to deal with other, non-comparable points (which exist when we're not dealing with a general ordering)
A dual cone of a cone $K$:
- $K^* = \left\{y \> \lvert \> y^Tx \ge 0 \> \forall \> x \in K \right\}$
- Any vector in the dual cone makes an angle $\le 90$ degrees to any vector in the original cone
- Some cones are self-dual (e.g. the positive orthant)
- $(K^*)^* = K$ if $K$ is proper
Optimal production frontier problem
- the optimal frontier vectors are vecotrs minimal w.r.t. $\mathbb{R}_{+}^{n}$

Supporting hyperplane theorem

A supporting hyperplane to set $C$ at boundary point $x_0$: $\left\{x \> | \> a^T x = a^T x_0 \right\}$, where $a \ne 0$ and $a^Tx \le a^Tx_0 \> \forall x \in C$
Goes through $x_0$ and all of the set is on one side of the hyperplane

Background: Convex functions

A function is convex if its domain ($\operatorname{\mathbf{dom}} \> f$) is a convex set and

\begin{equation} f(\theta x + (1 - \theta) y) \le \theta f(x) + (1 - \theta)f(y), \> \forall x, y \in \operatorname{\mathbf{dom}}{f}, 0 \le \theta \le 1 \end{equation}

Roughly speaking, upward curvature. Strict convexity defined trivially. $f$ is concave is $-f$ is convex.

Examples:

affine (linear + constant): $ax + b$ on $\mathbb{R}$ (the definition of convexity actually becomes an equality)
exponentials, powers, powers of absolute value
negative entropy $x\log{x}$ on $\mathbb{R}_{++}$
All norms are convex
All affine functions are both convex and concave

Trivial to establish for functions on $\mathbb{R}$.

Matrix function examples ($\mathbb{R}^{m \times n}$):

Affine function

\begin{equation} f(X) = \operatorname{tr}{A^TX} + b = \sum\limits_{i=1}^m \sum\limits_{j=1}^nA_{i,j}X_{i,j} + b \> \text{(the trace of the product is just the inner product of two matrices)} \end{equation}

Spectral(maximum singular value) norm (sqrt of max eigenvalue):

\begin{equation} f(X) = \|X\|_2 = \sigma_{max}(X) = \left(\lambda_{max}(X^TX)\right)^{\frac{1}{2}} \> \text{(non-trival to compute)} \end{equation}

TODO: Stanford Convex Optimization I, Lecture 3, 22:10

Linear program (LP)

Everything is affine (objective, (in)equality constraints).
Feasible set is polyhedron.
Objective is affine, so a hyperplane in practice.
Small LP $\implies$ astronomical number of vertices in feasible set.
Examples:
- diet problem (solution: quantities $x_1,\dots,x_n$ of $n$ foods, each with costs and nutrient compositions; need to keep nutrients above certain thresholds while minimizing the cost of the food)
- Cebyshev center of polyhedron

Duality

A way to handle the constraints by incorporating them into the objective. Irritation vs. value for a function. (See previous lecture)

Lagrangian

Turns a constrained optimization problem like the following:

\begin{equation} \begin{aligned} \text{minimize} \quad & f_0(x) \\ \text{subject to} \quad & f_i(x) \le 0, \quad i = 1, \dots, m \\ \text{ } & h_i(x) = 0, \quad i = 1, \dots, p \end{aligned} \end{equation}

Into a single objective $\mathcal{L}$ (or, simply $L$, if you don't want to type \mathcal all the time):

\begin{equation} \mathcal{L}(x, \lambda, v) = f_0(x) + \sum_{i=1}^{m}\lambda_i f_i(x) + \sum_{i=1}^{p} v_i h_i(x) \end{equation}

$\lambda_i$ and $v_i$ are the Lagrange multipliers. The vectors $\lambda$ and $v$ are called the dual variables or Lagrange multiplier vectors.

The Lagrange dual function $g(\lambda, v)$ is the minimum value of the Lagrangian over x (in the original data domain $\mathcal{D}$) for any $\lambda$ and $v$. Namely:

\begin{equation} g(\lambda, v) = \inf_{x \in \mathcal{D}} \> \mathcal{L}(x, \lambda, v) \end{equation}

This function gives us lower bounds on the optimal value $p^*$ of the original problem (i.e. our solution can't ever be better than $g$).

The Lagrange dual problem

The optimal solution for the problem is guaranteed to be larger than $g$ (lower bound property). So what?

Well, we can now solve the problem "from the other side" by maximizing the dual function, instead of minimizing the original constrained objective function.

\begin{equation}\begin{aligned} \text{maximize} & \quad g(\lambda, v) \\ \text{subject to} & \quad \lambda \succcurlyeq 0 \end{aligned}\end{equation}

A simple example: Least-squares solution of linear equations

SVMs are a bit complicated. Let's look at deriving the Lagrange dual function for a simpler problem first.

Primal statement:

\begin{equation} \begin{aligned} \text{minimize} \quad & x^Tx \\ \text{subject to} \quad & Ax = b \end{aligned} \end{equation}

This is a textbook example of a (convex) optimization problem. $A \in R^{p \times n}$, which means there are $p$ equality constraints. Apply Lagrange operator and we get the following Lagrangian, according to the definition:

\begin{equation} \mathcal{L}(x, v) = x^Tx + v^T (A x - b), \> \mathcal{L} : \mathbb{R}^{p \times n} \rightarrow \mathbb{R} \end{equation}

This yields the dual function $g(v) = \inf_x \mathcal{L}(x, v)$. Its actual value is simple to figure out, since the Lagrangian is a convex quadratic function (like $x^2 + ax + c$, but with vectors and different constants).

\begin{equation} \nabla_x \mathcal{L}(x, v) = 2 x + A^Tv = 0 \implies x = -\frac{1}{2} A^Tv \end{equation}

We can now formulate our dual function based on just $A$, $b$, and $v$:

\begin{equation} \begin{split} g(v) & = \inf_x \mathcal{L}(x, v) \\ & = \mathcal{L}(-\frac{1}{2}A^T v, v) \\ & = \left(-\frac{1}{2} A^T v \right)^T \left(-\frac{1}{2} A^T v \right) + v^T \left( -\frac{1}{2} A A^T v - b \right) \\ & = -\frac{1}{4} v^T A A^T v - b^T v \end{split} \end{equation}

Duality applied to SVMs

We can apply the aforementioned Lagrange operator to the primal SVM formulation in order to automagically include its constraints (the slack variable inequalities) in the objective function:

\begin{equation} \max_\alpha \sum_{i=1}^{n} - \frac{1}{2}\sum_{i,j}\alpha_{i} \alpha_{j} y_i y_j x_i^{T} x_j \quad \text{s. t.} \> 0 \le \alpha_{i} \le C \end{equation}

This is equivalent to an optimal model computed by minimizing the primal formulation expressed by $w^{*} = \sum_{i=1}^{n} \alpha_{i}^{*} y_i x_i$

The data points with $\alpha_i > 0$ are the support vectors

This means that our model only depends on the points $i$ which "misbehave", since otherwise the alpha would be 0. (dubious, do more research!)
Question: including points which are very "bad" and e.g. in the middle of the opposite class? Or is this completely different from how we see slack variables?

How do we predict with this formulation?

\begin{equation} y = \operatorname{sign} \left( \sum_{i=1}^{n} \alpha_i y_i k(x_i, x) \right) \end{equation}

This implies many things!

First off, no more "instant" classification as a result of taking the sign of an inner product (as was the case in the primal case).
We have a sum, so we need to do $\mathcal{O}(n)$ computations for every new point we want to classify.
Stop right there criminal scum!
Most $\alpha_i$s will be 0. So we only compute the sum from applying the kernel function to the new point and every support vector. And there are way fewer support vectors than total data points. Because otherwise it would mean that our data is very difficult to separate anyway, so a SVM wouldn't be a feasible approach anyway.
This may still prove to be to slow $\implies$ use "inverse kernel trick" (see below). This also doesn't work unless we train an SVM in a conventional way. So it doesn't work with (P)SGD, which is necessary when working with very large datasets.

Kernel functions

Kernels are cool. So what can we use as a kernel?
A kernel is a function $k : X \times X \rightarrow \mathbb{R}$ which satisfies
- symmetry: $\forall x, x' \in X \> k(x, x') = k(x', x)$ (otherwise our kernelized SVM would be sensitive to the order of our training data points; we're not doing OCP; this would be bad!)
- positive semi-definiteness of the kernel (Gram) matrix
Examples include:
- linear
- polynomial
- gaussian (unnormalized)
- data-type-specific kernels

The "Inverse Kernel Trick"

Explicitly generate low-dim nonlinear features, then apply ordinary linear method.
$x \in \mathbb{R}^d \rightarrow \phi(x) \in \mathbb{R}^{\text{D}} \rightarrow z(x) \in \mathbb{R}^m$
with $D \gg d$ and $m \ll D$
Note: this happens via feature generation. This is the key! We're operating at a "row level" in the data, and still using an ordinary (fast) linear method at the end (which doesn't depend on e.g. the number of support vectors).
A shift invariant kernel means $k(x, y) = k(x - y)$ (i.e. x and y only appear as a subtraction; $k$ is a function only of the difference between the arguments)
- these kernels have a fourier transform, so we can generate random Fourier features and enrich our data points
A radial basis kernel is a stronger shift invariant kernel, which only depends on the magnitude of the difference between the two arguments, i.e. $k(x, x') = k(\|x, x'\|)$.

Kernels and online learning

TODO

Miscellaneous

Pareto optimal = not stupid (Prof. Stephen Boyd, Stanford EE dept.)