Chapter 09
Support Vector Machines

1 Maximal Margin Classifier

1.1 What Is a Hyperplane?

In a $p$-dimensional space, a hyperplane is a flat affine subspace of dimension $p-1$.

In mathematical definition, a $p$-dimensional hyperplane satisfies the following equation: $$ \beta_0+\beta_1X_1+\beta_2X_2+\ldots+\beta_pX_p=0 $$

1.2 Classification Using a Separating Hyperplane

Suppose that we have $n \times p$ data matrix $X$ that consist of $n$ training observations in $p$-dimensional space, and these observation fall into two classes-that is, $y_1,\ldots,y_n \in \{-1,1\}$. Then a separating hyperplane has property that $$ (\beta_0+\beta_1x_{i1} +\beta_2x_{i2}+\ldots+\beta_px_{ip})y_i > 0 $$

1.3 The Maximal Margin Classifier

A natural choice is the maximal margin hyperplane, which is the separating hyperplane that is farthest from the training observations. Briefly, the maximal margin hyperplane is the solution to the optimization problem $$ \underset{\beta_0,\beta_1,\ldots,\beta_p}{\text{maximize}}M $$ subject to $$ (\beta_0+\beta_1x_{i1} +\beta_2x_{i2}+\ldots+\beta_px_{ip})y_i \ge M $$

2 Support Vector Classifier

soft margin classifier

It is the solution to the optimization problem $$ \underset{\beta_0,\beta_1,\ldots,\beta_p}{\text{maximize}}M $$ subject to $$\sum_{j=1}^{p}\beta_j^2=1$$ $$(\beta_0+\beta_1x_{i1} +\beta_2x_{i2}+\ldots+\beta_px_{ip})y_i \ge M(1-\epsilon_i)$$ $$\epsilon_i \ge 0, \sum_{i=1}^n\epsilon_i \le C$$

3 Support Vector Machines

It is an extension of the support vector classifier that results from enlarging the features space in a specific way, using kernels. The solution to the suppport vector classifier problem involves only the inner products of the observations. The inner product of two $r$-vector $a$ and $b$ is defined as $<a,b>=\sum_{i=1}^ra_ib_i$. Thus the inner product of two observations $x_i, x_{i'}$ is given by $$

<xi,x{i'}>=\sum{j=1}^{p}x{ij}x_{x'j} $$

4 SVMs with more than two classes

4.1 One-Versus-One Classification

A approach constructs ${k \choose 2}$ SVMs. We classify a test observation using each of the ${k \choose 2}$ classifiers, and tally the number of times that test observation is assigned to each of the $K$ classes.

4.2 One-Versus-All Classification

We fit K SVMs, each time comparing one the K classes to the remaining $K-1$ classes. Let $x^{*}$ denote a test observation. We assign the observation to the class for which $\beta_{0k}+\beta_{1k}x_1^*+\ldots+\beta_{pk}x_p^*$ is largest, as this amounts to a high level of confidence that the test observaton belongs to the $k$th class.