Practical methods for establishing convexity of a set $C$
Apply definition $x_1, x_2 \in C, \> 0 \le \theta \le 1 \implies \theta x_1 + (1 - \theta)x_2 \in C$
Show that $C$ is obtained from simple convex sets (hyperplanes, halfspaces, norm balls, etc.) by operations that preserve convexity
"Matlab" approach
A function is convex if its domain ($\operatorname{\mathbf{dom}} \> f$) is a convex set and
\begin{equation} f(\theta x + (1 - \theta) y) \le \theta f(x) + (1 - \theta)f(y), \> \forall x, y \in \operatorname{\mathbf{dom}}{f}, 0 \le \theta \le 1 \end{equation}Roughly speaking, upward curvature. Strict convexity defined trivially. $f$ is concave is $-f$ is convex.
Examples:
Trivial to establish for functions on $\mathbb{R}$.
Matrix function examples ($\mathbb{R}^{m \times n}$):
TODO: Stanford Convex Optimization I, Lecture 3, 22:10
A way to handle the constraints by incorporating them into the objective. Irritation vs. value for a function. (See previous lecture)
Turns a constrained optimization problem like the following:
\begin{equation} \begin{aligned} \text{minimize} \quad & f_0(x) \\ \text{subject to} \quad & f_i(x) \le 0, \quad i = 1, \dots, m \\ \text{ } & h_i(x) = 0, \quad i = 1, \dots, p \end{aligned} \end{equation}Into a single objective $\mathcal{L}$ (or, simply $L$, if you don't want to type \mathcal
all the time):
$\lambda_i$ and $v_i$ are the Lagrange multipliers. The vectors $\lambda$ and $v$ are called the dual variables or Lagrange multiplier vectors.
The Lagrange dual function $g(\lambda, v)$ is the minimum value of the Lagrangian over x (in the original data domain $\mathcal{D}$) for any $\lambda$ and $v$. Namely:
\begin{equation} g(\lambda, v) = \inf_{x \in \mathcal{D}} \> \mathcal{L}(x, \lambda, v) \end{equation}This function gives us lower bounds on the optimal value $p^*$ of the original problem (i.e. our solution can't ever be better than $g$).
The optimal solution for the problem is guaranteed to be larger than $g$ (lower bound property). So what?
Well, we can now solve the problem "from the other side" by maximizing the dual function, instead of minimizing the original constrained objective function.
\begin{equation}\begin{aligned} \text{maximize} & \quad g(\lambda, v) \\ \text{subject to} & \quad \lambda \succcurlyeq 0 \end{aligned}\end{equation}SVMs are a bit complicated. Let's look at deriving the Lagrange dual function for a simpler problem first.
Primal statement:
\begin{equation} \begin{aligned} \text{minimize} \quad & x^Tx \\ \text{subject to} \quad & Ax = b \end{aligned} \end{equation}This is a textbook example of a (convex) optimization problem. $A \in R^{p \times n}$, which means there are $p$ equality constraints. Apply Lagrange operator and we get the following Lagrangian, according to the definition:
\begin{equation} \mathcal{L}(x, v) = x^Tx + v^T (A x - b), \> \mathcal{L} : \mathbb{R}^{p \times n} \rightarrow \mathbb{R} \end{equation}This yields the dual function $g(v) = \inf_x \mathcal{L}(x, v)$. Its actual value is simple to figure out, since the Lagrangian is a convex quadratic function (like $x^2 + ax + c$, but with vectors and different constants).
\begin{equation} \nabla_x \mathcal{L}(x, v) = 2 x + A^Tv = 0 \implies x = -\frac{1}{2} A^Tv \end{equation}We can now formulate our dual function based on just $A$, $b$, and $v$:
\begin{equation} \begin{split} g(v) & = \inf_x \mathcal{L}(x, v) \\ & = \mathcal{L}(-\frac{1}{2}A^T v, v) \\ & = \left(-\frac{1}{2} A^T v \right)^T \left(-\frac{1}{2} A^T v \right) + v^T \left( -\frac{1}{2} A A^T v - b \right) \\ & = -\frac{1}{4} v^T A A^T v - b^T v \end{split} \end{equation}We can apply the aforementioned Lagrange operator to the primal SVM formulation in order to automagically include its constraints (the slack variable inequalities) in the objective function:
\begin{equation} \max_\alpha \sum_{i=1}^{n} - \frac{1}{2}\sum_{i,j}\alpha_{i} \alpha_{j} y_i y_j x_i^{T} x_j \quad \text{s. t.} \> 0 \le \alpha_{i} \le C \end{equation}This is equivalent to an optimal model computed by minimizing the primal formulation expressed by $w^{*} = \sum_{i=1}^{n} \alpha_{i}^{*} y_i x_i$
The data points with $\alpha_i > 0$ are the support vectors
How do we predict with this formulation?
\begin{equation} y = \operatorname{sign} \left( \sum_{i=1}^{n} \alpha_i y_i k(x_i, x) \right) \end{equation}This implies many things!