A Support Vector Machine (SVM) is a classifier that attempts to separate classes of data with a plane. The location of the plane in the input space is selected based on a set of training examples, $\mathbf{x}_i \in \Re^\text{d}$ $(i=1,2,\dots,N)$, labelled with a binary variable, $\mathbf{y}_i$. To query the label of a new point, the algorithm simply determines on which side of the plane the point sits, $\text{sgn}\big[f(\mathbf{x})\big]$ where $f(\mathbf{x})$ represents the discrimant function associated with the hyperplane

\begin{equation} f(\mathbf{x})=\mathbf{w}\Phi(\mathbf{x})+b. \end{equation}

Here, $\mathbf{w}$, $\mathbf{\Phi(\cdot)}$ and $b$ denote the weight vector, the basis function and the bias term, respectively. In two dimensions, this amounts to dividing the input space into two sections with a straight line. In higher dimensions, the division is determined by a plane (or hyperplane in 4 or more dimensions). Typically, the original data is projected into a higher dimensional feature space, $\Re^{\text{d}^*}$ where $d^* > d$, using a kernel function $\Phi(\mathbf{x})$ to produce more flexible decision boundaries in the lower dimensional input space.

The weight vector $\mathbf{w}$ is optimised using a cost function:

\begin{equation} \Psi(\mathbf{w,\xi})=\frac{1}{2}\|\mathbf{w}\|^2+C\sum^{N}_{i=1}\xi_i \end{equation}

subject to the constraints:

\begin{equation} y_i(\mathbf{w}\Phi(\mathbf{x})+b) \geqslant 1-\xi_i \end{equation}

and \begin{equation} \xi_i \geqslant 0 \; \; \; i=1,2,\dots,N. \end{equation}

Here, $\xi$ acts as a slack variable which is required when the data is non-separable. $C$ controls the penalty incurred by the model for ignoring misclassified points.

Finally, the weights $\mathbf{w}$ can be found from

\begin{equation} \mathbf{w} = \sum^{N}_{i=1}\alpha_i y_i \mathbf{x}_i, \end{equation}

where $\alpha_i$ comes from reformulating the cost function through a Lagrange function and finding the Lagrange multipliers via a dual optimization:

\begin{align} \max{\alpha} W(\alpha) = \sum^N_{i=1}\alpha_i - \frac{1}{2}\sum^{N}_{i=1}\alpha_i\alpha_jy_iy_j\Phi(\mathbf{x}_i)\Phi(\mathbf{x}_j) \end{align}

subject to the constraint $0 \leqslant \alpha_i \leqslant C$ for all $i=1,2,\dots,N$. The computational cost of determining the dot product $\Phi(\mathbf{x}_i)\Phi(\mathbf{x}_j)$ can be large for large $d^*$. Fortunately there exist kernel functions, $k(\mathbf{x}_i,\mathbf{x}_j)$, which return the dot product directly without the need to first calculate its factors.

The SVM in this paper was trained using five-fold cross-validation. The key parameters, $C$ and $\gamma$ (a coefficient of the kernel function), were optimised to maximise the model's predictive power on observations that were held-out during training. The final values employed for the results presented were $C = 10$ and $\gamma = 0.1$.

To handle the multi-label nature of the data, a one-versus-one approach was adopted. A separate SVM was trained for each pair of labels resulting in $L(L-1)$ different classifers, where $L$ is the number of labels. During prediction, each classifier was evaluated at the query location and the most frequently returned label was used as the model's output.

A Support Vector Machine (SVM) is a classifier that attempts to separate classes of data mapped to a space where it can be separated by a hyperplane.

The SVM in this paper was trained using five-fold cross-validation. The key parameters, $C$ (which controls the penalty incurred by the model for ignoring misclassified points) and $\gamma$ (a coefficient of the SVM's kernel function), were optimised to maximise the model's predictive power on observations that were held-out during training. The final values employed for the results presented were $C = 10$ and $\gamma = 0.1$.

In-depth supplementary material:

Simplified Version: