Support Vector Machine

Support Vector Machine (SVM) is a classification method first described in 1992 by Boser, Guyon, and Vapnik. They are widely used in Natural Language Processing, Bioinformatics and other fields in Machine Learning due to the high accuracy of results and ability to handle large dimensional data. Despite the name, there isn't anything 'Machine-y' about them. Perhaps a more informative name for what this classifier is about would have been 'Max Margin Classifier'.

A Discriminative Linear Classifier for Two Classes

Let's say we have a classification problem where each observation belongs to exactly one of two classes. A linear discriminative classifier tries identify the "best" hyperplane or decision surface to separate the observations for each class. Once we know how to specify the decision surface, we can then calculate which side of the decision surface our test observation (which we're trying to label) belongs to.

In the figure below, H1, H2 and H3 are all examples of such decision surfaces. This is a 2D example, so the decision surface is a line.

Margin in Linearly Separable Data

To understand how a Max Margin Classifier solves the problem of finding the 'best' hyperplane, we must first understand how to define a margin.

Let $x$ be the feature matrix. Then $x$ has 1 row per observation and 1 column per feature. For the training dataset D, the labels are given by the vector $y$.

For a given weight vector $w$, the decision surface is defined by the linear discriminant function:

$$f(w) = w^Tw + b$$

Where $b$ is a bias term that shifts the decision surface away from the origin.

For the moment, let's assume that our training data is linearly separable. Let $x_{+}$ and $x_{-}$ be the points in the two classes closest to the decision surface. Then the margin of f w.r.t. the feature dataset D is:

$$m_{D}(f) = \frac{1}{2} (\hat{w}^Tx_{+} - \hat{w}Tx_{-})$$

Where $\hat{w}$ is a unit vector, i.e. $\hat{w} = \frac {w} {||w||}$

Maximizing the Margin

Let's also assume that $x_{+}$ and $x_{-}$ are equidistant from $f$. So,

$$\begin{align} &f(x_{+}) = w^Tx_{+} + b = a\ (some\ constant) \\ &f(x_{-}) = w^Tx_{-} + b = -a \end{align}$$

To make the geometric margin meaningful we fix the value of the decision function at the points closest to the hyperplane, and set a = 1. Then the margin function becomes:

$$\begin{align} &m_{D}(f) = \frac{1}{2} (\hat{w}^Tx_{+} - \hat{w}Tx_{-}) \\ & = \frac {1}{2||w||} (w^Tx_{+} - wTx_{-}) \\ & = \frac {2a}{2||w||} = \frac {1} {||w||} \\ \end{align}$$

We can now frame the problem of finding the decision surface as a constrained optimization problem:

Maximize $\frac {1} {||w||}$ such that for each $x_i$,

$w^Tx_i + b > 1$, if the class label $y_i$ is label 1.
$w^Tx_i + b < -1$, if the class label $y_i$ is label 2.

We can combine the two constraints into a single equivalent constraint by making the class labels +1 and -1. . In which case, the constraint becomes

$y_i(w^Tx_i + b) > 1$.

Maximizing $\frac {1} {||w||}$ is the same as minimizing $||w||^2$. So the optimization problem now becomes (noting that $||w||^2 = w^Tw$):

Minimize $w^Tw$ such that for each $x_i$, $y_i(w^Tx_i + b) > 1$.

Tradeoff Between Margin Size and Mis-classification Rate

So far we've assumed that the data for the two classes are linearly separable. But that's not always the case in real life. In that case, allowing the classifier to mis-classify some observations can lead to a bigger margin.

To make room for misclassification of some points, we make the margin "soft" by a introducing a slack (or error, or loss) variable $\xi_i (w_i, x_i, y_i)$. The inequality constaint in the optimization problem now becomes $y_iw^Tx_i + b > 1 - \xi_i$, with $\xi_i > 0$. In addition, we also want to control the amount of slack - so we add the sum of all errors to the quantity we want to minimize. The modified optimization problem now is:

Minimize $w^Tw + C\sum_i \xi_i$ such that for each $x_i$, $y_iw^Tx_i + b > 1- \xi_i$.

The constant C is a regularization constant. Higher values of C implies lower slack tolerance, and vice-versa.

For a linear SVM, two kinds of loss functions are commonly used:

L1 Loss: $\xi_i = max(0, 1-(y_iw_i^Tx_i + b))$
L2 Loss: $\xi_i = [max(0, (1-y_iw_i^Tx_i + b))]^2$

Formulating the loss function this way makes the inequality constraint redundant. So the final formulation is the unconstrained problem:

L1 SVM: Minimize $w^Tw + C\sum_i \xi_i$, where $\xi_i = max(0, 1-(y_iw_i^Tx_i + b))$
L2 SVM: Minimize $w^Tw + C\sum_i \xi_i$, where $\xi_i = [max(0, (1-y_iw_i^Tx_i + b))]^2$

Dual of the Optimization Problem

Sometimes we also like to consider the Lagrangian Dual of the optimization problem above:

Maximize $f(\alpha) = \frac {1}{2}\alpha_i^TQ\alpha_i - \sum_i \alpha_i$, subject to the constraint $0 < \alpha_i < U$.

Reversing the signs of the two terms in $f(x)$ we get:

Minimize $f(\alpha) = \sum_i \alpha_i - \frac {1}{2}\alpha_i^TQ\alpha_i$, subject to the constraint $0 < \alpha_i < U$.

$\alpha$ is related to $w$ by $w_i = \sum_i y_i\alpha_ix_i$
$Q$ is a matrix such that $Q_{ij} = y_i y_j x_i^T x_j$
L1 SVM: $U = C$
L2 SVM: $U = \infty$

Illustration: L1 SVM for Two Classes in the IRIS Dataset

We'll now build an L1 SVM for two classes ('setosa' and 'versicolor') in the well-known IRIS dataset. For optimization, we choose to solve the dual problem using Coordinate Descent described in this paper.



In [1]:

    
# Imports and data fetch

%matplotlib nbagg
import numpy as np
import pandas as pd
import seaborn as sns

iris_df = sns.load_dataset('iris')
iris_df = iris_df.set_index('species')
iris_df.head(2)









    Out[1]:






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
    
    
      species
      
      
      
      
    
  
  
    
      setosa
      5.1
      3.5
      1.4
      0.2
    
    
      setosa
      4.9
      3.0
      1.4
      0.2



In [2]:

    
iris_df.index.unique()









    Out[2]:





Index(['setosa', 'versicolor', 'virginica'], dtype='object', name='species')



In [3]:

    
USE_LABELS = ['setosa', 'versicolor']

def label2val(label_or_val, inv=False):    
    # Converts between the class labels and their numerical values.    
    if inv:        
        if label_or_val >= 1.0:
            return 'setosa'
        elif label_or_val <= -1.0:
            return 'versicolor'
        else:
            print('WARN: Unable to classify w^Tx = [%s]' % label_or_val)
            return 'setosa' if label_or_val >= 0 else 'versicolor'
    else:
        if label_or_val == 'setosa':
            return 1.0
        else:
            return -1.0
      
iris_2C = iris_df.loc[USE_LABELS, :]
sns.pairplot(iris_2C.reset_index(), hue='species')









    














    











    Out[3]:





<seaborn.axisgrid.PairGrid at 0x25250f6fb00>



In [4]:

    
# Set aside 10 observations from each class for testing.

def split_train_test(df, label, n=10):
    label_df = df.loc[label, :]
    test_idx = np.random.randint(0, high=len(label_df)-1, size=n)
    train_idx = {i for i in range(len(label_df))} - {i for i in test_idx}    
    return label_df.ix[train_idx, :], label_df.ix[test_idx, :]

setosa_train, setosa_test = split_train_test(iris_2C, 'setosa')
versicolor_train, versicolor_test = split_train_test(iris_2C, 'versicolor')

total_train = pd.concat([setosa_train, versicolor_train], axis=0)
total_test = pd.concat([setosa_test, versicolor_test], axis=0)



In [5]:

    
# Code the linear SVM
def linear_svm(X, Y, n_iter=2000, C=1000): 
    # X is 2D: s_samples * n_features
    # Y is 1D: n_samples
    
    U = C # for L1 SVM
    n_samples, n_features = X.shape
    
    assert n_samples == Y.shape[0]
    
    # initialize alpha to zero => w = 0
    alpha, w = np.zeros(n_samples), np.zeros(n_features)
    
    for iter_idx in range(n_iter):
        if iter_idx == 0 or (iter_idx+1)%200 == 0:
            print('Weights before iteration [%s]: %s' % (iter_idx+1, w))
            
        for i in range(n_samples):
            G = y[i]*np.dot(w, x[i]) - 1.0 # D_ii = 0 for L1 SVM
            
            if alpha[i] == 0:
                PG = min(G, 0)
            elif alpha[i] == U:
                PG = max(G, 0)
            elif alpha[i] > 0 and alpha[i] < U:
                PG = G
            
            if PG != 0:
                alpha_i_bar = alpha[i]
                Q_ii = y[i]*y[i] * np.dot(x[i], x[i])
                temp = max(alpha[i] - G/Q_ii, 0)
                alpha[i] = min(temp, U)
                w += (alpha[i] - alpha_i_bar)*y[i]*x[i]
                
    return w
            
x = total_train.as_matrix()
y = total_train.index.map(label2val)
w_opt = linear_svm(x, y)









    



Weights before iteration [1]: [ 0.  0.  0.  0.]
Weights before iteration [200]: [ 0.32992312  0.4703677  -1.05350484 -0.63455695]
Weights before iteration [400]: [ 0.35179909  0.42623698 -1.05999261 -0.61799092]
Weights before iteration [600]: [ 0.35188488  0.42604327 -1.06000585 -0.61791231]
Weights before iteration [800]: [ 0.35188521  0.42604253 -1.0600059  -0.61791201]
Weights before iteration [1000]: [ 0.35188522  0.42604252 -1.0600059  -0.61791201]
Weights before iteration [1200]: [ 0.35188522  0.42604252 -1.0600059  -0.61791201]
Weights before iteration [1400]: [ 0.35188522  0.42604252 -1.0600059  -0.61791201]
Weights before iteration [1600]: [ 0.35188522  0.42604252 -1.0600059  -0.61791201]
Weights before iteration [1800]: [ 0.35188522  0.42604252 -1.0600059  -0.61791201]
Weights before iteration [2000]: [ 0.35188522  0.42604252 -1.0600059  -0.61791201]



In [6]:

    
# Evaluate the test set
def classify_row(w, row):
    prod = np.dot(w, row.as_matrix())
    return label2val(prod, inv=True)

test_results = total_test.apply(lambda row: classify_row(w_opt, row), axis=1)
test_results.name = 'Actual Label'
test_results.index.name = 'Expected Label'

test_results.reset_index()









    Out[6]:






  
    
      
      Expected Label
      Actual Label
    
  
  
    
      0
      setosa
      setosa
    
    
      1
      setosa
      setosa
    
    
      2
      setosa
      setosa
    
    
      3
      setosa
      setosa
    
    
      4
      setosa
      setosa
    
    
      5
      setosa
      setosa
    
    
      6
      setosa
      setosa
    
    
      7
      setosa
      setosa
    
    
      8
      setosa
      setosa
    
    
      9
      setosa
      setosa
    
    
      10
      versicolor
      versicolor
    
    
      11
      versicolor
      versicolor
    
    
      12
      versicolor
      versicolor
    
    
      13
      versicolor
      versicolor
    
    
      14
      versicolor
      versicolor
    
    
      15
      versicolor
      versicolor
    
    
      16
      versicolor
      versicolor
    
    
      17
      versicolor
      versicolor
    
    
      18
      versicolor
      versicolor
    
    
      19
      versicolor
      versicolor

	Expected Label	Actual Label
0	setosa	setosa
1	setosa	setosa
2	setosa	setosa
3	setosa	setosa
4	setosa	setosa
5	setosa	setosa
6	setosa	setosa
7	setosa	setosa
8	setosa	setosa
9	setosa	setosa
10	versicolor	versicolor
11	versicolor	versicolor
12	versicolor	versicolor
13	versicolor	versicolor
14	versicolor	versicolor
15	versicolor	versicolor
16	versicolor	versicolor
17	versicolor	versicolor
18	versicolor	versicolor
19	versicolor	versicolor