In [2]:
# To visualize plots in the notebook
%matplotlib inline
# Imported libraries
#import csv
#import random
#import matplotlib
#import matplotlib.pyplot as plt
#import pylab
#import numpy as np
#from mpl_toolkits.mplot3d import Axes3D
#from sklearn.preprocessing import PolynomialFeatures
#from sklearn import linear_model
from sklearn import svm
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.
The advantages of support vector machines are:
The disadvantages of support vector machines include:
SVC, NuSVC and LinearSVC are classes capable of performing multi-class classification on a dataset.
SVC and NuSVC are similar methods, but accept slightly different sets of parameters and have different mathematical formulations (see section Mathematical formulation). On the other hand, LinearSVC is another implementation of Support Vector Classification for the case of a linear kernel. Note that LinearSVC does not accept keyword kernel, as this is assumed to be linear. It also lacks some of the members of SVC and NuSVC, like support_. As other classifiers, SVC, NuSVC and LinearSVC take as input two arrays: an array X of size [n_samples, n_features] holding the training samples, and an array y of class labels (strings or integers), size [n_samples]:
In [3]:
X = [[0, 0], [1, 1]]
y = [0, 1]
clf = svm.SVC()
clf.fit(X, y)
Out[3]:
After being fitted, the model can then be used to predict new values:
In [4]:
clf.predict([[2., 2.]])
Out[4]:
SVMs decision function depends on some subset of the training data, called the support vectors. Some properties of these support vectors can be found in members supportvectors, support_ and n_support:
In [7]:
# get support vectors
print clf.support_vectors_
# get indices of support vectors
print clf.support_
# get number of support vectors for each class
print clf.n_support_
As an illustration, consider the Iris dataset , taken from the UCI Machine Learning repository. This data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant (setosa, versicolor or virginica). Each instance contains 4 measurements of given flowers: sepal length, sepal width, petal length and petal width, all in centimeters.
We will try to fit the logistic regression model to discriminate between two classes using only two attributes.
First, we load the dataset and split them in training and test subsets.
In [5]:
# Adapted from a notebook by Jason Brownlee
def loadDataset(filename, split):
xTrain = []
cTrain = []
xTest = []
cTest = []
with open(filename, 'rb') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for i in range(len(dataset)-1):
for y in range(4):
dataset[i][y] = float(dataset[i][y])
item = dataset[i]
if random.random() < split:
xTrain.append(item[0:4])
cTrain.append(item[4])
else:
xTest.append(item[0:4])
cTest.append(item[4])
return xTrain, cTrain, xTest, cTest
with open('iris.data', 'rb') as csvfile:
lines = csv.reader(csvfile)
xTrain_all, cTrain_all, xTest_all, cTest_all = loadDataset('iris.data', 0.66)
nTrain_all = len(xTrain_all)
nTest_all = len(xTest_all)
print 'Train: ' + str(nTrain_all)
print 'Test: ' + str(nTest_all)
Now, we select two classes and two attributes.
In [6]:
# Select attributes
i = 0 # Try 0,1,2,3
j = 1 # Try 0,1,2,3 with j!=i
# Select two classes
c0 = 'Iris-versicolor'
c1 = 'Iris-virginica'
# Select two coordinates
ind = [i, j]
# Take training test
X_tr = np.array([[xTrain_all[n][i] for i in ind] for n in range(nTrain_all)
if cTrain_all[n]==c0 or cTrain_all[n]==c1])
C_tr = [cTrain_all[n] for n in range(nTrain_all)
if cTrain_all[n]==c0 or cTrain_all[n]==c1]
Y_tr = np.array([int(c==c1) for c in C_tr])
n_tr = len(X_tr)
# Take test set
X_tst = np.array([[xTest_all[n][i] for i in ind] for n in range(nTest_all)
if cTest_all[n]==c0 or cTest_all[n]==c1])
C_tst = [cTest_all[n] for n in range(nTest_all)
if cTest_all[n]==c0 or cTest_all[n]==c1]
Y_tst = np.array([int(c==c1) for c in C_tst])
n_tst = len(X_tst)
Normalization of data is a common pre-processing step in many machine learning algorithms. Its goal is to get a dataset where all input coordinates have a similar scale. Learning algorithms usually show less instabilities and convergence problems when data are normalized.
We will define a normalization function that returns a training data matrix with zero sample mean and unit sample variance.
In [7]:
def normalize(X, mx=None, sx=None):
# Compute means and standard deviations
if mx is None:
mx = np.mean(X, axis=0)
if sx is None:
sx = np.std(X, axis=0)
# Normalize
X0 = (X-mx)/sx
return X0, mx, sx
Now, we can normalize training and test data. Observe in the code that the same transformation should be applied to training and test data. This is the reason why normalization with the test data is done using the means and the variances computed with the training set.
In [8]:
# Normalize data
Xn_tr, mx, sx = normalize(X_tr)
Xn_tst, mx, sx = normalize(X_tst, mx, sx)
The following figure generates a plot of the normalized training data.
In [9]:
# Separate components of x into different arrays (just for the plots)
x0c0 = [Xn_tr[n][0] for n in range(n_tr) if Y_tr[n]==0]
x1c0 = [Xn_tr[n][1] for n in range(n_tr) if Y_tr[n]==0]
x0c1 = [Xn_tr[n][0] for n in range(n_tr) if Y_tr[n]==1]
x1c1 = [Xn_tr[n][1] for n in range(n_tr) if Y_tr[n]==1]
# Scatterplot.
labels = {'Iris-setosa': 'Setosa',
'Iris-versicolor': 'Versicolor',
'Iris-virginica': 'Virginica'}
plt.plot(x0c0, x1c0,'r.', label=labels[c0])
plt.plot(x0c1, x1c1,'g+', label=labels[c1])
plt.xlabel('$x_' + str(ind[0]) + '$')
plt.ylabel('$x_' + str(ind[1]) + '$')
plt.legend(loc='best')
plt.axis('equal')
Out[9]:
In order to apply the gradient descent rule, we need to define two methods:
fit
method, that receives the training data and returns the model weights and the value of the negative log-likelihood during all iterations.predict
method, that receives the model weight and a set of inputs, and returns the posterior class probabilities for that input, as well as their corresponding class predictions.
In [10]:
def logregFit(Z_tr, Y_tr, rho, n_it):
# Data dimension
n_dim = Z_tr.shape[1]
# Initialize variables
nll_tr = np.zeros(n_it)
pe_tr = np.zeros(n_it)
w = np.random.randn(n_dim,1)
# Running the gradient descent algorithm
for n in range(n_it):
# Compute posterior probabilities for weight w
p1_tr = logistic(np.dot(Z_tr, w))
p0_tr = logistic(-np.dot(Z_tr, w))
# Compute negative log-likelihood
nll_tr[n] = - np.dot(Y_tr.T, np.log(p1_tr)) - np.dot((1-Y_tr).T, np.log(p0_tr))
# Update weights
w += rho*np.dot(Z_tr.T, Y_tr - p1_tr)
return w, nll_tr
def logregPredict(Z, w):
# Compute posterior probability of class 1 for weights w.
p = logistic(np.dot(Z, w))
# Class
D = [int(round(pn)) for pn in p]
return p, D
We can test the behavior of the gradient descent method by fitting a logistic regression model with ${\bf z}({\bf x}) = (1, {\bf x}^\intercal)^\intercal$.
In [11]:
# Parameters of the algorithms
rho = float(1)/50 # Learning step
n_it = 200 # Number of iterations
# Compute Z's
Z_tr = np.c_[np.ones(n_tr), Xn_tr]
Z_tst = np.c_[np.ones(n_tst), Xn_tst]
n_dim = Z_tr.shape[1]
# Convert target arrays to column vectors
Y_tr2 = Y_tr[np.newaxis].T
Y_tst2 = Y_tst[np.newaxis].T
# Running the gradient descent algorithm
w, nll_tr = logregFit(Z_tr, Y_tr2, rho, n_it)
# Classify training and test data
p_tr, D_tr = logregPredict(Z_tr, w)
p_tst, D_tst = logregPredict(Z_tst, w)
# Compute error rates
E_tr = D_tr!=Y_tr
E_tst = D_tst!=Y_tst
# Error rates
pe_tr = float(sum(E_tr)) / n_tr
pe_tst = float(sum(E_tst)) / n_tst
# NLL plot.
plt.plot(range(n_it), nll_tr,'b.:', label='Train')
plt.xlabel('Iteration')
plt.ylabel('Negative Log-Likelihood')
plt.legend()
print "The optimal weights are:"
print w
print "The final error rates are:"
print "- Training: " + str(pe_tr)
print "- Test: " + str(pe_tst)
print "The NLL after training is " + str(nll_tr[len(nll_tr)-1])
Under certaing conditions, the gradient descent method can be shown to converge assymtotically (i.e. as the number of iterations goes to infinity) to the ML estimate of the logistic model. However, in practice, the final estimate of the weights ${\bf w}$ depend on several factors:
Exercise: Visualize the variability of gradient descent caused by initializations. To do so, fix the number of iterations to 200 and the learning step, and execute the gradient descent 100 times, storing the training error rate of each execution. Plot the histogram of the error rate values.
Note that you can do this exercise with a loop over the 100 executions, including the code in the previous code slide inside the loop, with some proper modifications. To plot a histogram of the values in array p
with n
bins, you can use plt.hist(p, n)
The learning step, $\rho$, is a free parameter of the algorithm. Its choice is critical for the convergence of the algorithm. too large values of $\rho$ make the algorithm diverge. For too small values, the convergence is too slown and more iterations are required for a good convergence.
Exercise 3: Observe the evolution of the negative log-likelihood with the number of iterations, for different values of $\rho$. It is easy to check that, for $\rho$ large enough, the gradient descent method does not converge. Can you estimate (through manual observation) and approximate value of $\rho$ stating a boundary between convergence and non-convergence?
Exercise 4: In this exercise we explore the influence of the learning step more sistematically. Use the code in the previouse exercises to compute, for every value of $\rho$, the average error rate over 100 executions. Plot the average error rate vs. $\rho$.
Note that you should explore the values of $\rho$ in a logarithmic scale. For instance, you can take $\rho = 1, 1/10, 1/100, 1/1000, \ldots$
In practice, the selection of $\rho$ may be a matter of trial an error. Also there is some theoretical evidence that the learning step should decrease along time up to cero, and the sequence $\rho_n$ should satisfy two conditions:
For instance, we can take $\rho_n= 1/n$. Another common choice is $\rho_n = \alpha/(1+\beta n)$ where $\alpha$ and $\beta$ are also free parameters that can be selected by trial and error with some heuristic method.
In [12]:
# Create a regtangular grid.
x_min, x_max = Xn_tr[:, 0].min(), Xn_tr[:, 0].max()
y_min, y_max = Xn_tr[:, 1].min(), Xn_tr[:, 1].max()
dx = x_max - x_min
dy = y_max - y_min
h = dy /400
xx, yy = np.meshgrid(np.arange(x_min - 0.1 * dx, x_max + 0.1 * dx, h),
np.arange(y_min - 0.1 * dx, y_max + 0.1 * dy, h))
X_grid = np.array([xx.ravel(), yy.ravel()]).T
# Compute Z's
Z_grid = np.c_[np.ones(X_grid.shape[0]), X_grid]
# Compute the classifier output for all samples in the grid.
pp, dd = logregPredict(Z_grid, w)
# Put the result into a color plot
plt.plot(x0c0, x1c0,'r.', label=labels[c0])
plt.plot(x0c1, x1c1,'g+', label=labels[c1])
plt.xlabel('$x_' + str(ind[0]) + '$')
plt.ylabel('$x_' + str(ind[1]) + '$')
plt.legend(loc='best')
plt.axis('equal')
pp = pp.reshape(xx.shape)
plt.contourf(xx, yy, pp, cmap=plt.cm.copper)
Out[12]:
In [13]:
# Parameters of the algorithms
rho = float(1)/50 # Learning step
n_it = 500 # Number of iterations
g = 5 # Degree of polynomial
# Compute Z_tr
poly = PolynomialFeatures(degree=g)
Z_tr = poly.fit_transform(Xn_tr)
# Normalize columns (this is useful to make algorithms more stable).)
Zn, mz, sz = normalize(Z_tr[:,1:])
Z_tr = np.concatenate((np.ones((n_tr,1)), Zn), axis=1)
# Compute Z_tst
Z_tst = poly.fit_transform(Xn_tst)
Zn, mz, sz = normalize(Z_tst[:,1:], mz, sz)
Z_tst = np.concatenate((np.ones((n_tst,1)), Zn), axis=1)
# Convert target arrays to column vectors
Y_tr2 = Y_tr[np.newaxis].T
Y_tst2 = Y_tst[np.newaxis].T
# Running the gradient descent algorithm
w, nll_tr = logregFit(Z_tr, Y_tr2, rho, n_it)
# Classify training and test data
p_tr, D_tr = logregPredict(Z_tr, w)
p_tst, D_tst = logregPredict(Z_tst, w)
# Compute error rates
E_tr = D_tr!=Y_tr
E_tst = D_tst!=Y_tst
# Error rates
pe_tr = float(sum(E_tr)) / n_tr
pe_tst = float(sum(E_tst)) / n_tst
# NLL plot.
plt.plot(range(n_it), nll_tr,'b.:', label='Train')
plt.xlabel('Iteration')
plt.ylabel('Negative Log-Likelihood')
plt.legend()
print "The optimal weights are:"
print w
print "The final error rates are:"
print "- Training: " + str(pe_tr)
print "- Test: " + str(pe_tst)
print "The NLL after training is " + str(nll_tr[len(nll_tr)-1])
Visualizing the posterior map we can se that the polynomial transformations produces nonlinear decision boundaries.