This lab consists in implementing the Support Vector Machines (SVM) algorithm.
Given a training set $ D = \left\{ \left(x^{(i)}, y^{(i)}\right), x^{(i)} \in \mathcal{X}, y^{(i)} \in \mathcal{Y}, i \in \{1, \dots, n \} \right\}$, where $\mathcal{Y} = \{ 1, \dots, k\}$ . Recall (from lecture 7), SVM aims at minimizing the following cost function $J$: $$ \begin{split} J(\theta_1, \theta_2, \dots, \theta_k) &= \sum_{i = 1}^n L_i \\ &= \sum_{i = 1}^n \sum_{j \neq y_i} \max(0, \theta_j^Tx^{(i)} - \theta_{y^{(i)}}^T x^{(i)} + \Delta) \end{split} $$
In [1]:
k_classes = 2
X = [[1., 1.5, 0.2], [1., 0.3, 1.2], [1, 1.6, 0.4], [1., 1.3, 0.25], [1., 0.5, 1.12]]
Y = [1, 2, 1, 1, 2]
Let us take a look at the data in 2D (we ignore the intercept which is constantly equal to 1).
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure()
X1 = [x[1] for x in X]
X2 = [x[2] for x in X]
plt.scatter(X1, X2, c=Y) # plot x1, x2, color is defined by the label y
plt.show()
The data was generated so that we have two quite distinct classes. This is usually not the case in reality, and for this reason we will see what happens when outliers are implemented (see homework below).
In [3]:
def score(x, theta):
return 0
Exercise: Define a function cost_function
that takes as parameter a sample (the actual label $y$, the feature vector $x$), the $\theta$s for each classes as well as $\Delta$ and returns the value of the cost function for this sample.
Hint: Recall from lecture 7 that it is given by: $$ L_i = \sum_{j \neq y_i} \max(0, \theta_j^Tx - \theta_{y}^T x + \Delta) $$
In [4]:
def cost_function(x, y, thetas, delta):
return 0
Now we are able to compute the loss for a single training sample, we can get the total cost.
Exercise: Define a function cost_function_total
that will compute the total cost function given by
$$ J = \sum_{i = 1}^n L_i = \sum_{i = 1}^n \sum_{j \neq y_i} \max(0, \theta_j^Tx^{(i)} - \theta_{y^{(i)}}^T x^{(i)} + \Delta) $$
In [5]:
def cost_function_total(X, Y, thetas, delta):
return 0
Recall that the prediction on a feature vector $x$ is given by the value of $j$ that maximizes the score $\theta_j^T x$.
Exercise: Define a function predict
which takes a feature vector x
as well as the $\theta_j$s and outputs the predicted class $\hat{y}$ which is the value of $j$ maximizing the score.
Hint: We have defined a score
function for this purpose.
In [6]:
def predict(x, thetas):
return 0
We have just defined everything we need to make the prediction and compute the loss for the SVM problem. As usual, we want to minimize this loss. The gradient descent works well in this case and in order to apply it, we need to compute the gradient.
Recall that we need to compute a gradient per class as we have $k$ vectors $\theta_j$. The gradient for a sample $x, y$ is given by the following formulas:
In [7]:
def gradients(x, y, thetas, delta):
return 0
The last quantity needed in order to apply the gradient descent is the total gradient (if we want to apply a batch gradient descent, the gradient for a single sample is enough in the stochastic gradient descent case is enough).
To compute it, we just need to sum the gradients for all the samples within the training set.
Exercise: Implement a function gradient_total
that takes as inputs a set of feature vectors $X$, a set of labels $Y$, values for the $\theta_j$s as well as the hyperparamter $\Delta$ and outputs the gradient of $J$.
In [8]:
def gradient_total(X, Y, thetas, delta):
return 0
Now that we have the gradient, we can apply the gradient descent algorithm.
Exercise: Implement a function gradient_descent
that takes as parameter the training set $(X, Y)$, the hyperparameter $\Delta$ as well as a learning rate and applied the gradient descent algorithm to the SVM case.
Hint: Feel free to get back to the lectures to recall the gradient descent update.
In [9]:
def gradient_descent(X, Y, delta, learning_rate):
return 0
At this point, you should be able to guess the questions for the homework and solve them. As for the OLS case, the questions are: