Chapter 09
Support Vector Machines
In a $p$-dimensional space, a hyperplane is a flat affine subspace of dimension $p-1$.
In mathematical definition, a $p$-dimensional hyperplane satisfies the following equation: $$ \beta_0+\beta_1X_1+\beta_2X_2+\ldots+\beta_pX_p=0 $$
Suppose that we have $n \times p$ data matrix $X$ that consist of $n$ training observations in $p$-dimensional space, and these observation fall into two classes-that is, $y_1,\ldots,y_n \in \{-1,1\}$. Then a separating hyperplane has property that $$ (\beta_0+\beta_1x_{i1} +\beta_2x_{i2}+\ldots+\beta_px_{ip})y_i > 0 $$
A natural choice is the maximal margin hyperplane, which is the separating hyperplane that is farthest from the training observations. Briefly, the maximal margin hyperplane is the solution to the optimization problem $$ \underset{\beta_0,\beta_1,\ldots,\beta_p}{\text{maximize}}M $$ subject to $$ (\beta_0+\beta_1x_{i1} +\beta_2x_{i2}+\ldots+\beta_px_{ip})y_i \ge M $$
It is the solution to the optimization problem $$ \underset{\beta_0,\beta_1,\ldots,\beta_p}{\text{maximize}}M $$ subject to $$\sum_{j=1}^{p}\beta_j^2=1$$ $$(\beta_0+\beta_1x_{i1} +\beta_2x_{i2}+\ldots+\beta_px_{ip})y_i \ge M(1-\epsilon_i)$$ $$\epsilon_i \ge 0, \sum_{i=1}^n\epsilon_i \le C$$
It is an extension of the support vector classifier that results from enlarging the features space in a specific way, using kernels. The solution to the suppport vector classifier problem involves only the inner products of the observations. The inner product of two $r$-vector $a$ and $b$ is defined as $<a,b>=\sum_{i=1}^ra_ib_i$. Thus the inner product of two observations $x_i, x_{i'}$ is given by $$
<xi,x{i'}>=\sum{j=1}^{p}x{ij}x_{x'j} $$
A approach constructs ${k \choose 2}$ SVMs. We classify a test observation using each of the ${k \choose 2}$ classifiers, and tally the number of times that test observation is assigned to each of the $K$ classes.
We fit K SVMs, each time comparing one the K classes to the remaining $K-1$ classes. Let $x^{*}$ denote a test observation. We assign the observation to the class for which $\beta_{0k}+\beta_{1k}x_1^*+\ldots+\beta_{pk}x_p^*$ is largest, as this amounts to a high level of confidence that the test observaton belongs to the $k$th class.