Support Vector Machines

Notebook version: 1.0 (Oct 26, 2015)

Author: Jesús Cid Sueiro (jcid@tsc.uc3m.es)

Changes: v.1.0 - First version

In [2]:
# To visualize plots in the notebook
%matplotlib inline

# Imported libraries
#import csv
#import random
#import matplotlib
#import matplotlib.pyplot as plt
#import pylab

#import numpy as np
#from mpl_toolkits.mplot3d import Axes3D
#from sklearn.preprocessing import PolynomialFeatures
#from sklearn import linear_model
from sklearn import svm

1. Introduction

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

The advantages of support vector machines are:

  • Effective in high dimensional spaces.
  • Still effective in cases where number of dimensions is greater than the number of samples.
  • Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
  • Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

  • If the number of features is much greater than the number of samples, the method is likely to give poor performances.
  • SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).
  • The support vector machines in scikit-learn support both dense (numpy.ndarray and convertible to that by numpy.asarray) and sparse (any scipy.sparse) sample vectors as input. However, to use an SVM to make predictions for sparse data, it must have been fit on such data. For optimal performance, use C-ordered numpy.ndarray (dense) or scipy.sparse.csr_matrix (sparse) with dtype=float64.

SVM implementations in Scikit Learn

SVC, NuSVC and LinearSVC are classes capable of performing multi-class classification on a dataset.

SVC and NuSVC are similar methods, but accept slightly different sets of parameters and have different mathematical formulations (see section Mathematical formulation). On the other hand, LinearSVC is another implementation of Support Vector Classification for the case of a linear kernel. Note that LinearSVC does not accept keyword kernel, as this is assumed to be linear. It also lacks some of the members of SVC and NuSVC, like support_. As other classifiers, SVC, NuSVC and LinearSVC take as input two arrays: an array X of size [n_samples, n_features] holding the training samples, and an array y of class labels (strings or integers), size [n_samples]:


In [3]:
X = [[0, 0], [1, 1]]
y = [0, 1]
clf = svm.SVC()
clf.fit(X, y)


Out[3]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

After being fitted, the model can then be used to predict new values:


In [4]:
clf.predict([[2., 2.]])


Out[4]:
array([1])

SVMs decision function depends on some subset of the training data, called the support vectors. Some properties of these support vectors can be found in members supportvectors, support_ and n_support:


In [7]:
# get support vectors
print clf.support_vectors_
# get indices of support vectors
print clf.support_ 
# get number of support vectors for each class
print clf.n_support_


[[ 0.  0.]
 [ 1.  1.]]
[0 1]
[1 1]

3.2.1 Example: Iris Dataset.

As an illustration, consider the Iris dataset , taken from the UCI Machine Learning repository. This data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant (setosa, versicolor or virginica). Each instance contains 4 measurements of given flowers: sepal length, sepal width, petal length and petal width, all in centimeters.

We will try to fit the logistic regression model to discriminate between two classes using only two attributes.

First, we load the dataset and split them in training and test subsets.


In [5]:
# Adapted from a notebook by Jason Brownlee
def loadDataset(filename, split):
    xTrain = []
    cTrain = []
    xTest = []
    cTest = []

    with open(filename, 'rb') as csvfile:
        lines = csv.reader(csvfile)
        dataset = list(lines)
    for i in range(len(dataset)-1):
        for y in range(4):
            dataset[i][y] = float(dataset[i][y])
        item = dataset[i]
        if random.random() < split:
            xTrain.append(item[0:4])
            cTrain.append(item[4])
        else:
            xTest.append(item[0:4])
            cTest.append(item[4])
    return xTrain, cTrain, xTest, cTest

with open('iris.data', 'rb') as csvfile:
    lines = csv.reader(csvfile)

xTrain_all, cTrain_all, xTest_all, cTest_all = loadDataset('iris.data', 0.66)
nTrain_all = len(xTrain_all)
nTest_all = len(xTest_all)
print 'Train: ' + str(nTrain_all)
print 'Test: ' + str(nTest_all)


Train: 93
Test: 57

Now, we select two classes and two attributes.


In [6]:
# Select attributes
i = 0 # Try 0,1,2,3
j = 1 # Try 0,1,2,3 with j!=i

# Select two classes
c0 = 'Iris-versicolor' 
c1 = 'Iris-virginica'

# Select two coordinates
ind = [i, j]

# Take training test
X_tr = np.array([[xTrain_all[n][i] for i in ind] for n in range(nTrain_all) 
                  if cTrain_all[n]==c0 or cTrain_all[n]==c1])
C_tr = [cTrain_all[n] for n in range(nTrain_all) 
          if cTrain_all[n]==c0 or cTrain_all[n]==c1]
Y_tr = np.array([int(c==c1) for c in C_tr])
n_tr = len(X_tr)

# Take test set
X_tst = np.array([[xTest_all[n][i] for i in ind] for n in range(nTest_all) 
                 if cTest_all[n]==c0 or cTest_all[n]==c1])
C_tst = [cTest_all[n] for n in range(nTest_all) 
         if cTest_all[n]==c0 or cTest_all[n]==c1]
Y_tst = np.array([int(c==c1) for c in C_tst])
n_tst = len(X_tst)

3.2.2. Data normalization

Normalization of data is a common pre-processing step in many machine learning algorithms. Its goal is to get a dataset where all input coordinates have a similar scale. Learning algorithms usually show less instabilities and convergence problems when data are normalized.

We will define a normalization function that returns a training data matrix with zero sample mean and unit sample variance.


In [7]:
def normalize(X, mx=None, sx=None):
    
    # Compute means and standard deviations
    if mx is None:
        mx = np.mean(X, axis=0)
    if sx is None:
        sx = np.std(X, axis=0)

    # Normalize
    X0 = (X-mx)/sx

    return X0, mx, sx

Now, we can normalize training and test data. Observe in the code that the same transformation should be applied to training and test data. This is the reason why normalization with the test data is done using the means and the variances computed with the training set.


In [8]:
# Normalize data
Xn_tr, mx, sx = normalize(X_tr)
Xn_tst, mx, sx = normalize(X_tst, mx, sx)

The following figure generates a plot of the normalized training data.


In [9]:
# Separate components of x into different arrays (just for the plots)
x0c0 = [Xn_tr[n][0] for n in range(n_tr) if Y_tr[n]==0]
x1c0 = [Xn_tr[n][1] for n in range(n_tr) if Y_tr[n]==0]
x0c1 = [Xn_tr[n][0] for n in range(n_tr) if Y_tr[n]==1]
x1c1 = [Xn_tr[n][1] for n in range(n_tr) if Y_tr[n]==1]

# Scatterplot.
labels = {'Iris-setosa': 'Setosa', 
          'Iris-versicolor': 'Versicolor',
          'Iris-virginica': 'Virginica'}
plt.plot(x0c0, x1c0,'r.', label=labels[c0])
plt.plot(x0c1, x1c1,'g+', label=labels[c1])
plt.xlabel('$x_' + str(ind[0]) + '$')
plt.ylabel('$x_' + str(ind[1]) + '$')
plt.legend(loc='best')
plt.axis('equal')


Out[9]:
(-3.0, 3.0, -3.0, 3.0)

In order to apply the gradient descent rule, we need to define two methods:

  • A fit method, that receives the training data and returns the model weights and the value of the negative log-likelihood during all iterations.
  • A predict method, that receives the model weight and a set of inputs, and returns the posterior class probabilities for that input, as well as their corresponding class predictions.

In [10]:
def logregFit(Z_tr, Y_tr, rho, n_it):

    # Data dimension
    n_dim = Z_tr.shape[1]

    # Initialize variables
    nll_tr = np.zeros(n_it)
    pe_tr = np.zeros(n_it)
    w = np.random.randn(n_dim,1)

    # Running the gradient descent algorithm
    for n in range(n_it):
        
        # Compute posterior probabilities for weight w
        p1_tr = logistic(np.dot(Z_tr, w))
        p0_tr = logistic(-np.dot(Z_tr, w))

        # Compute negative log-likelihood
        nll_tr[n] = - np.dot(Y_tr.T, np.log(p1_tr)) - np.dot((1-Y_tr).T, np.log(p0_tr))

        # Update weights
        w += rho*np.dot(Z_tr.T, Y_tr - p1_tr)

    return w, nll_tr

def logregPredict(Z, w):

    # Compute posterior probability of class 1 for weights w.
    p = logistic(np.dot(Z, w))
    
    # Class
    D = [int(round(pn)) for pn in p]
    
    return p, D

We can test the behavior of the gradient descent method by fitting a logistic regression model with ${\bf z}({\bf x}) = (1, {\bf x}^\intercal)^\intercal$.


In [11]:
# Parameters of the algorithms
rho = float(1)/50    # Learning step
n_it = 200   # Number of iterations

# Compute Z's
Z_tr = np.c_[np.ones(n_tr), Xn_tr] 
Z_tst = np.c_[np.ones(n_tst), Xn_tst]
n_dim = Z_tr.shape[1]

# Convert target arrays to column vectors
Y_tr2 = Y_tr[np.newaxis].T
Y_tst2 = Y_tst[np.newaxis].T

# Running the gradient descent algorithm
w, nll_tr = logregFit(Z_tr, Y_tr2, rho, n_it)

# Classify training and test data
p_tr, D_tr = logregPredict(Z_tr, w)
p_tst, D_tst = logregPredict(Z_tst, w)
    
# Compute error rates
E_tr = D_tr!=Y_tr
E_tst = D_tst!=Y_tst

# Error rates
pe_tr = float(sum(E_tr)) / n_tr
pe_tst = float(sum(E_tst)) / n_tst

# NLL plot.
plt.plot(range(n_it), nll_tr,'b.:', label='Train')
plt.xlabel('Iteration')
plt.ylabel('Negative Log-Likelihood')
plt.legend()

print "The optimal weights are:"
print w
print "The final error rates are:"
print "- Training: " + str(pe_tr)
print "- Test: " + str(pe_tst)
print "The NLL after training is " + str(nll_tr[len(nll_tr)-1])


The optimal weights are:
[[-0.03785205]
 [ 1.1542709 ]
 [ 0.48717643]]
The final error rates are:
- Training: 0.230769230769
- Test: 0.371428571429
The NLL after training is 34.2132457772

3.2.3. Free parameters

Under certaing conditions, the gradient descent method can be shown to converge assymtotically (i.e. as the number of iterations goes to infinity) to the ML estimate of the logistic model. However, in practice, the final estimate of the weights ${\bf w}$ depend on several factors:

  • Number of iterations
  • Initialization
  • Learning step

Exercise: Visualize the variability of gradient descent caused by initializations. To do so, fix the number of iterations to 200 and the learning step, and execute the gradient descent 100 times, storing the training error rate of each execution. Plot the histogram of the error rate values.

Note that you can do this exercise with a loop over the 100 executions, including the code in the previous code slide inside the loop, with some proper modifications. To plot a histogram of the values in array p with nbins, you can use plt.hist(p, n)

3.2.3.1. Learning step

The learning step, $\rho$, is a free parameter of the algorithm. Its choice is critical for the convergence of the algorithm. too large values of $\rho$ make the algorithm diverge. For too small values, the convergence is too slown and more iterations are required for a good convergence.

Exercise 3: Observe the evolution of the negative log-likelihood with the number of iterations, for different values of $\rho$. It is easy to check that, for $\rho$ large enough, the gradient descent method does not converge. Can you estimate (through manual observation) and approximate value of $\rho$ stating a boundary between convergence and non-convergence?

Exercise 4: In this exercise we explore the influence of the learning step more sistematically. Use the code in the previouse exercises to compute, for every value of $\rho$, the average error rate over 100 executions. Plot the average error rate vs. $\rho$.

Note that you should explore the values of $\rho$ in a logarithmic scale. For instance, you can take $\rho = 1, 1/10, 1/100, 1/1000, \ldots$

In practice, the selection of $\rho$ may be a matter of trial an error. Also there is some theoretical evidence that the learning step should decrease along time up to cero, and the sequence $\rho_n$ should satisfy two conditions:

  • C1: $\sum_{n=0}^{\infty} \rho_n^2 < \infty$ (decrease slowly)
  • C2: $\sum_{n=0}^{\infty} \rho_n = \infty$ (do not decrease too much slowly)

For instance, we can take $\rho_n= 1/n$. Another common choice is $\rho_n = \alpha/(1+\beta n)$ where $\alpha$ and $\beta$ are also free parameters that can be selected by trial and error with some heuristic method.

3.2.4. Visualizing the posterior map.

We can also visualize the posterior probability map estimated by the logistic regression model for the estimated weights.


In [12]:
# Create a regtangular grid.
x_min, x_max = Xn_tr[:, 0].min(), Xn_tr[:, 0].max() 
y_min, y_max = Xn_tr[:, 1].min(), Xn_tr[:, 1].max()
dx = x_max - x_min
dy = y_max - y_min
h = dy /400
xx, yy = np.meshgrid(np.arange(x_min - 0.1 * dx, x_max + 0.1 * dx, h),
                     np.arange(y_min - 0.1 * dx, y_max + 0.1 * dy, h))
X_grid = np.array([xx.ravel(), yy.ravel()]).T

# Compute Z's
Z_grid = np.c_[np.ones(X_grid.shape[0]), X_grid] 

# Compute the classifier output for all samples in the grid.
pp, dd = logregPredict(Z_grid, w)

# Put the result into a color plot
plt.plot(x0c0, x1c0,'r.', label=labels[c0])
plt.plot(x0c1, x1c1,'g+', label=labels[c1])
plt.xlabel('$x_' + str(ind[0]) + '$')
plt.ylabel('$x_' + str(ind[1]) + '$')
plt.legend(loc='best')
plt.axis('equal')
pp = pp.reshape(xx.shape)
plt.contourf(xx, yy, pp, cmap=plt.cm.copper)


Out[12]:
<matplotlib.contour.QuadContourSet instance at 0x107aa85f0>

3.2.5. Polynomial Logistic Regression

The error rates of the logitic regression model can be potentially reduce by using polynomial transformations.

To compute the polinomial transformation up to a given degree, we can use the PolynomialFeatures method in sklearn.preprocessing.


In [13]:
# Parameters of the algorithms
rho = float(1)/50    # Learning step
n_it = 500   # Number of iterations
g = 5 # Degree of polynomial

# Compute Z_tr
poly = PolynomialFeatures(degree=g)
Z_tr = poly.fit_transform(Xn_tr)
# Normalize columns (this is useful to make algorithms more stable).)
Zn, mz, sz = normalize(Z_tr[:,1:])
Z_tr = np.concatenate((np.ones((n_tr,1)), Zn), axis=1)

# Compute Z_tst
Z_tst = poly.fit_transform(Xn_tst)
Zn, mz, sz = normalize(Z_tst[:,1:], mz, sz)
Z_tst = np.concatenate((np.ones((n_tst,1)), Zn), axis=1)

# Convert target arrays to column vectors
Y_tr2 = Y_tr[np.newaxis].T
Y_tst2 = Y_tst[np.newaxis].T

# Running the gradient descent algorithm
w, nll_tr = logregFit(Z_tr, Y_tr2, rho, n_it)

# Classify training and test data
p_tr, D_tr = logregPredict(Z_tr, w)
p_tst, D_tst = logregPredict(Z_tst, w)
    
# Compute error rates
E_tr = D_tr!=Y_tr
E_tst = D_tst!=Y_tst

# Error rates
pe_tr = float(sum(E_tr)) / n_tr
pe_tst = float(sum(E_tst)) / n_tst

# NLL plot.
plt.plot(range(n_it), nll_tr,'b.:', label='Train')
plt.xlabel('Iteration')
plt.ylabel('Negative Log-Likelihood')
plt.legend()

print "The optimal weights are:"
print w
print "The final error rates are:"
print "- Training: " + str(pe_tr)
print "- Test: " + str(pe_tst)
print "The NLL after training is " + str(nll_tr[len(nll_tr)-1])


The optimal weights are:
[[ 0.60540481]
 [ 1.26644588]
 [ 0.0916665 ]
 [-2.20412289]
 [ 0.97864553]
 [-0.83616536]
 [ 0.28858738]
 [-0.07280199]
 [ 1.09552117]
 [ 1.86237099]
 [ 3.88830239]
 [ 1.01656416]
 [-3.90169614]
 [ 2.03193089]
 [ 1.8108233 ]
 [-0.78435048]
 [ 0.02628487]
 [ 0.83488124]
 [ 0.43468305]
 [-0.10673226]
 [-1.2449976 ]]
The final error rates are:
- Training: 0.169230769231
- Test: 0.428571428571
The NLL after training is 28.1133853647

Visualizing the posterior map we can se that the polynomial transformations produces nonlinear decision boundaries.