Support Vector Machines

Course recap

This lab consists in implementing the Support Vector Machines (SVM) algorithm.

Given a training set $ D = \left\{ \left(x^{(i)}, y^{(i)}\right), x^{(i)} \in \mathcal{X}, y^{(i)} \in \mathcal{Y}, i \in \{1, \dots, n \} \right\}$, where $\mathcal{Y} = \{ 1, \dots, k\}$ . Recall (from lecture 7), SVM aims at minimizing the following cost function $J$: $$ \begin{split} J(\theta_1, \theta_2, \dots, \theta_k) &= \sum_{i = 1}^n L_i \\ &= \sum_{i = 1}^n \sum_{j \neq y_i} \max(0, \theta_j^Tx^{(i)} - \theta_{y^{(i)}}^T x^{(i)} + \Delta) \end{split} $$

Defining the training set

Let us define variables X and Y that will contain the features $\mathcal{X}$ and labels $\mathcal{Y}$ of the training set. Again, we will be having an intercept $x_0^{(i)}$.



In [1]:

    
k_classes = 2
X = [[1., 1.5, 0.2], [1., 0.3, 1.2], [1, 1.6, 0.4], [1., 1.3, 0.25], [1., 0.5, 1.12]]
Y = [1, 2, 1, 1, 2]

Let us take a look at the data in 2D (we ignore the intercept which is constantly equal to 1).



In [2]:

    
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure()
X1 = [x[1] for x in X]
X2 = [x[2] for x in X]
plt.scatter(X1, X2, c=Y) # plot x1, x2, color is defined by the label y
plt.show()

The data was generated so that we have two quite distinct classes. This is usually not the case in reality, and for this reason we will see what happens when outliers are implemented (see homework below).

Prediction function

Exercise: Define a function score that takes as parameter the feature vector $x$ as well as a model $\theta$ and outputs the score: $$ h(x) = \theta^T x = \sum_{j = 0}^d \theta_j x_j$$



In [3]:

    
def score(x, theta):
    return 0

Defining the cost function

Cost function on a single sample

Exercise: Define a function cost_function that takes as parameter a sample (the actual label $y$, the feature vector $x$), the $\theta$s for each classes as well as $\Delta$ and returns the value of the cost function for this sample.

Hint: Recall from lecture 7 that it is given by: $$ L_i = \sum_{j \neq y_i} \max(0, \theta_j^Tx - \theta_{y}^T x + \Delta) $$



In [4]:

    
def cost_function(x, y, thetas, delta):
    return 0

Now we are able to compute the loss for a single training sample, we can get the total cost.

Exercise: Define a function cost_function_total that will compute the total cost function given by $$ J = \sum_{i = 1}^n L_i = \sum_{i = 1}^n \sum_{j \neq y_i} \max(0, \theta_j^Tx^{(i)} - \theta_{y^{(i)}}^T x^{(i)} + \Delta) $$



In [5]:

    
def cost_function_total(X, Y, thetas, delta):
    return 0

Recall that the prediction on a feature vector $x$ is given by the value of $j$ that maximizes the score $\theta_j^T x$.

Exercise: Define a function predict which takes a feature vector x as well as the $\theta_j$s and outputs the predicted class $\hat{y}$ which is the value of $j$ maximizing the score.

Hint: We have defined a score function for this purpose.



In [6]:

    
def predict(x, thetas):
    return 0

Gradient

We have just defined everything we need to make the prediction and compute the loss for the SVM problem. As usual, we want to minimize this loss. The gradient descent works well in this case and in order to apply it, we need to compute the gradient.

Recall that we need to compute a gradient per class as we have $k$ vectors $\theta_j$. The gradient for a sample $x, y$ is given by the following formulas:

if $j \neq y$: $$ \nabla_{\theta_j} L = \begin{cases} x \quad \text{if} \quad \theta_j^Tx - \theta_{y}^Tx + \Delta > 0 \\ 0 \quad \text{otherwise.} \end{cases} $$
if $j = y$: $$ \nabla_{\theta_y} L = px $$ where $p$ is the number of times the desired margin is not satisfied, that is, the number of $j \neq y$ such that $\theta_j^Tx - \theta_{y}^Tx + \Delta > 0$



In [7]:

    
def gradients(x, y, thetas, delta):
    return 0

The last quantity needed in order to apply the gradient descent is the total gradient (if we want to apply a batch gradient descent, the gradient for a single sample is enough in the stochastic gradient descent case is enough).

To compute it, we just need to sum the gradients for all the samples within the training set.

Exercise: Implement a function gradient_total that takes as inputs a set of feature vectors $X$, a set of labels $Y$, values for the $\theta_j$s as well as the hyperparamter $\Delta$ and outputs the gradient of $J$.



In [8]:

    
def gradient_total(X, Y, thetas, delta):
    return 0

Now that we have the gradient, we can apply the gradient descent algorithm.

Exercise: Implement a function gradient_descent that takes as parameter the training set $(X, Y)$, the hyperparameter $\Delta$ as well as a learning rate and applied the gradient descent algorithm to the SVM case.

Hint: Feel free to get back to the lectures to recall the gradient descent update.



In [9]:

    
def gradient_descent(X, Y, delta, learning_rate):
    return 0

Homework

At this point, you should be able to guess the questions for the homework and solve them. As for the OLS case, the questions are:

Define initial vectors $\theta_j$s (for example with only zeros) and apply the gradient descent to the simple data set we have defined above. Do not forget to play with the learning rate and well as $\Delta$
Do the same with the stochastic gradient descent version.
For both solutions above, observe the loss function in 2D, and observe the path
What an outlier look like in this case? Add an outlier to the training set, and run the SVM algorithm to it. What issue do we observe? Propose a solution to overcome this issue and implement it.