- Nonlinear transformation on data, then just apply a linear classifier
- We will use this
*kernel trick*with our SVMs - In order to be able to do this, we will have to reformulate one of our primal SVM formulations into a dual formulation

- An
**affine set**always also contains the line through any two points in itself.- Any two points $x_1, x_2$; $x = \theta x_1 + (1 - \theta)x_2, \theta \in \mathbb{R}$
- example: solution set of linear equations $\{x \> | \> Ax = b \}$.

- A
**convex set**always contains the line segment between any two points in itself.- Any two points $x_1, x_2$; $x = \theta x_1 + (1 - \theta)x_2, \theta \in [0, 1]$
- example: from geometry, a square-shape is convex, a donut isn't
- Intuition: an affine set is a more general form of a convex set
*Any affine set is also a convex set.*

- A
**convex combination**of points $x_1,\dots,x_k$: any point $x$ of the form:- $ x = \theta_1x_1 + \theta_2x_2 + \dots + \theta_kx_k $
- with $\theta_1 + \dots + \theta_k = 1, \> \theta_i \ge 0$
- omitting the second constraint $\implies$ linear combination

- A
**convex hull of a set S**($\operatorname{conv} S$) is the set of all convex combination points in S.

- A
**hyperplane**is a set of the form $\{x \> | \> a^Tx = b\}, (a \ne 0)$- $a$ is the normal, $b$ is the intercept (or bias)
- all hyperplanes are affine (and convex)

- A
**halfspace**is a set of the form $\{x \> | \> a^x < b\}, (a \ne 0)$- same terminology as for hyperplanes
- halfspaces are not vector spaces (not closed under scalar multiplication; we can easily "escape it")
- all halfspaces are convex

- Euclidean = using the $l_2$ (Euclidean) norm in their description
- An
**(Euclidean) ball**of center $x_c$ and radius $r$ is- $B(x_c, r) = \left\{ x \> | \> \| x - x_c \|_2 \le r \right\} = \left\{x_c + ru \> | \> \|u\|_2 \le 1 \right\}$

- An
**ellipsoid**is a set of form:- $\left\{x\>|\> (x - x_c)^{T} P^{-1} (x - x_c) \le 1 \right\}$, with $P \in \mathbf{S}^n_{++}$ ($P$ symmetric, positive definite matrix)
- generalization of the Euclidean ball

- Both are convex sets

- Any function $\| \cdot \|$ which satisfies
- positive; $\|x\| = 0 \iff x = 0$
- $\|tx\| = \lvert t \rvert \|x\| \> \forall t \in \mathbb{R}$
- $\|x + y \| \le \|x\| + \|y\|$

- Can define a
*norm ball*with any norm and radius. The Euclidean ball from the previous section was a particular example.- in $\mathbb{R}^2$, $\| \cdot \|_2$ is a diamond shape centered around the origin; $\| \cdot \|_\infty$ is a square.

- Solution sets of finitely many linear equalities and inequalities
- $Ax \preccurlyeq b$ (component-wise less than, since we're shoving a bunch of inequalities in a big matrix-vector multiplication followed by a vector-vector comparison) and $Cx = d$

- Convex
- Polyhedron is intersection of finite number of halfspaces and hyperplanes
- Halfspaces and hyperplanes are special cases of polyhedra
- Equalities turn our resulting polyhedron into a "slice" in that dimension

Practical methods for establishing convexity of a set $C$

Apply definition $x_1, x_2 \in C, \> 0 \le \theta \le 1 \implies \theta x_1 + (1 - \theta)x_2 \in C$

Show that $C$ is obtained from simple convex sets (hyperplanes, halfspaces, norm balls, etc.) by operations that preserve convexity

- intersection (any number); the intersection of any number of convex sets is convex
- affine functions
- perspective functions
- linear-fractional functions

"Matlab" approach

- Pick some random points from the space
- Test points by brute force (or maybe just keep trying with $\theta = 0.5$)

- Suppose $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$ is
*affine*($\> f(x) = Ax + b, A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^m$) - $f$ turns convex sets into convex sets
- $f^{-1}$ does the same
- Examples: scaling, translation, projection

- A
**perspective function**$P : \mathbb{R}^{n+1} \rightarrow \mathbb{R}^{n}$:- $P(x, t) = x / t$ (divide first n - 1 by last element and discard it)
- Perspective functions and their inverses conserve convexity

- A
**linear-fractional function**- A generalization of the perspective function.
- Maintains convexity

- A simple ray from the origin is not a proper cone.
- The nonnegative orthant is a proper cone.
- Can define point inequalities in proper cones. However, they don't define a general ordering.
- Minimum vs. minimal has to deal with other, non-comparable points (which exist when we're not dealing with a general ordering)
- A dual cone of a cone $K$:
- $K^* = \left\{y \> \lvert \> y^Tx \ge 0 \> \forall \> x \in K \right\}$
- Any vector in the dual cone makes an angle $\le 90$ degrees to any vector in the original cone
- Some cones are self-dual (e.g. the positive orthant)
- $(K^*)^* = K$ if $K$ is proper

- Optimal production frontier problem
- the optimal frontier vectors are vecotrs minimal w.r.t. $\mathbb{R}_{+}^{n}$

- A
**supporting hyperplane**to set $C$ at boundary point $x_0$: $\left\{x \> | \> a^T x = a^T x_0 \right\}$, where $a \ne 0$ and $a^Tx \le a^Tx_0 \> \forall x \in C$ - Goes through $x_0$ and all of the set is on one side of the hyperplane

A function is **convex** if its domain ($\operatorname{\mathbf{dom}} \> f$) is a convex set and

Roughly speaking, upward curvature. Strict convexity defined trivially. $f$ is concave is $-f$ is convex.

Examples:

- affine (linear + constant): $ax + b$ on $\mathbb{R}$ (the definition of convexity actually becomes an equality)
- exponentials, powers, powers of absolute value
- negative entropy $x\log{x}$ on $\mathbb{R}_{++}$
- All norms are convex
- All affine functions are both convex and concave

Trivial to establish for functions on $\mathbb{R}$.

Matrix function examples ($\mathbb{R}^{m \times n}$):

- Affine function

- Spectral(maximum singular value) norm (sqrt of max eigenvalue):

TODO: Stanford Convex Optimization I, Lecture 3, 22:10

- Everything is affine (objective, (in)equality constraints).
- Feasible set is
*polyhedron*. - Objective is affine, so a hyperplane in practice.
- Small LP $\implies$ astronomical number of vertices in feasible set.
- Examples:
- diet problem (solution: quantities $x_1,\dots,x_n$ of $n$ foods, each with costs and nutrient compositions; need to keep nutrients above certain thresholds while minimizing the cost of the food)
- Cebyshev center of polyhedron

A way to handle the constraints by incorporating them into the objective. Irritation vs. value for a function. (See previous lecture)

Turns a constrained optimization problem like the following:

\begin{equation} \begin{aligned} \text{minimize} \quad & f_0(x) \\ \text{subject to} \quad & f_i(x) \le 0, \quad i = 1, \dots, m \\ \text{ } & h_i(x) = 0, \quad i = 1, \dots, p \end{aligned} \end{equation}Into a single objective $\mathcal{L}$ (or, simply $L$, if you don't want to type `\mathcal`

all the time):

$\lambda_i$ and $v_i$ are the **Lagrange multipliers**.
The vectors $\lambda$ and $v$ are called the *dual variables* or *Lagrange multiplier vectors*.

The **Lagrange dual function** $g(\lambda, v)$ is the minimum value of the Lagrangian over x (in the original data domain $\mathcal{D}$) for any $\lambda$ and $v$. Namely:

This function gives us lower bounds on the optimal value $p^*$ of the original problem (i.e. our solution can't ever be better than $g$).

The optimal solution for the problem is guaranteed to be larger than $g$ (lower bound property). *So what?*

Well, we can now solve the problem "from the other side" by maximizing the dual function, instead of minimizing the original constrained objective function.

\begin{equation}\begin{aligned} \text{maximize} & \quad g(\lambda, v) \\ \text{subject to} & \quad \lambda \succcurlyeq 0 \end{aligned}\end{equation}SVMs are a bit complicated. Let's look at deriving the *Lagrange dual function* for a simpler problem first.

Primal statement:

\begin{equation} \begin{aligned} \text{minimize} \quad & x^Tx \\ \text{subject to} \quad & Ax = b \end{aligned} \end{equation}This is a textbook example of a (convex) optimization problem.
$A \in R^{p \times n}$, which means there are $p$ equality constraints.
Apply Lagrange operator and we get the following *Lagrangian*, according to the definition:

This yields the *dual function* $g(v) = \inf_x \mathcal{L}(x, v)$. Its actual value is simple to figure out, since the Lagrangian is a convex quadratic function (like $x^2 + ax + c$, but with vectors and different constants).

We can now formulate our dual function based on just $A$, $b$, and $v$:

\begin{equation} \begin{split} g(v) & = \inf_x \mathcal{L}(x, v) \\ & = \mathcal{L}(-\frac{1}{2}A^T v, v) \\ & = \left(-\frac{1}{2} A^T v \right)^T \left(-\frac{1}{2} A^T v \right) + v^T \left( -\frac{1}{2} A A^T v - b \right) \\ & = -\frac{1}{4} v^T A A^T v - b^T v \end{split} \end{equation}We can apply the aforementioned Lagrange operator to the primal SVM formulation in order to automagically include its constraints (the slack variable inequalities) in the objective function:

\begin{equation} \max_\alpha \sum_{i=1}^{n} - \frac{1}{2}\sum_{i,j}\alpha_{i} \alpha_{j} y_i y_j x_i^{T} x_j \quad \text{s. t.} \> 0 \le \alpha_{i} \le C \end{equation}This is equivalent to an optimal model computed by minimizing the primal formulation expressed by $w^{*} = \sum_{i=1}^{n} \alpha_{i}^{*} y_i x_i$

The data points with $\alpha_i > 0$ are the **support vectors**

- This means that our model only depends on the points $i$ which "misbehave", since otherwise the alpha would be 0. (dubious, do more research!)
- Question: including points which are very "bad" and e.g. in the middle of the opposite class? Or is this completely different from how we see slack variables?

How do we predict with this formulation?

\begin{equation} y = \operatorname{sign} \left( \sum_{i=1}^{n} \alpha_i y_i k(x_i, x) \right) \end{equation}This implies many things!

- First off, no more "instant" classification as a result of taking the sign of an inner product (as was the case in the primal case).
- We have a sum, so we need to do $\mathcal{O}(n)$ computations for every new point we want to classify.
- Stop right there criminal scum!
- Most $\alpha_i$s will be 0. So we only compute the sum from applying the kernel function to the new point and every support vector. And there are
*way fewer*support vectors than total data points. Because otherwise it would mean that our data is very difficult to separate anyway, so a SVM wouldn't be a feasible approach anyway. - This may still prove to be to slow $\implies$ use "inverse kernel trick" (see below). This also doesn't work unless we train an SVM in a conventional way. So it doesn't work with (P)SGD, which is necessary when working with very large datasets.

- Kernels are cool. So what can we use as a kernel?
- A
**kernel**is a function $k : X \times X \rightarrow \mathbb{R}$ which satisfies*symmetry*: $\forall x, x' \in X \> k(x, x') = k(x', x)$ (otherwise our kernelized SVM would be sensitive to the order of our training data points; we're not doing OCP; this would be bad!)*positive semi-definiteness*of the kernel (Gram) matrix

- Examples include:
- linear
- polynomial
- gaussian (unnormalized)
- data-type-specific kernels

- Explicitly generate low-dim nonlinear features, then apply ordinary linear method.
- $x \in \mathbb{R}^d \rightarrow \phi(x) \in \mathbb{R}^{\text{D}} \rightarrow z(x) \in \mathbb{R}^m$
- with $D \gg d$ and $m \ll D$
- Note: this happens via feature generation. This is the key! We're operating at a "row level" in the data, and still using an ordinary (fast) linear method at the end (which doesn't depend on e.g. the number of support vectors).
- A
**shift invariant kernel**means $k(x, y) = k(x - y)$ (i.e. x and y only appear as a subtraction; $k$ is a function only of the difference between the arguments)- these kernels have a fourier transform, so we can generate random Fourier features and enrich our data points

- A
**radial basis kernel**is a*stronger shift invariant kernel*, which only depends on the magnitude of the difference between the two arguments, i.e. $k(x, x') = k(\|x, x'\|)$.