This notebook demonstrates the reduction of a multiclass problem into binary ones using Shogun. Here, we will describe the built-in One-vs-Rest, One-vs-One and Error Correcting Output Codes strategies.
In SHOGUN
, the strategies of reducing a multiclass problem to binary
classification problems are described by an instance of
CMulticlassStrategy
. A multiclass strategy describes
The user can derive from the virtual class CMulticlassStrategy to implement a customized multiclass strategy. But usually the built-in strategies are enough for general problems. We will describe the built-in One-vs-Rest, One-vs-One and Error-Correcting Output Codes strategies in this tutorial.
The basic routine to use a multiclass machine with reduction to binary problems in shogun is to create a generic multiclass machine and then assign a particular multiclass strategy and a base binary machine.
The One-vs-Rest strategy is implemented in
MulticlassOneVsRestStrategy
. As indicated by the name, this
strategy reduce a $K$-class problem to $K$ binary sub-problems. For the $k$-th
problem, where $k\in\{1,\ldots,K\}$, the samples from class $k$ are colored as
$+1$, and the samples from other classes are colored as $-1$. The multiclass
prediction is given as
where $f_k(x)$ is the prediction of the $k$-th binary machines.
The One-vs-Rest strategy is easy to implement yet produces excellent performance in many cases. One interesting paper, Rifkin, R. M. and Klautau, A. (2004). In defense of one-vs-all classification. Journal of Machine Learning Research, 5:101–141, it was shown that the One-vs-Rest strategy can be
as accurate as any other approach, assuming that the underlying binary classifiers are well-tuned regularized classifiers such as support vector machines.
Implemented in MulticlassOneVsOneStrategy, the One-vs-One strategy is another simple and intuitive strategy: it basically produces one binary problem for each pair of classes. So there will be $\binom{K}{2}$ binary problems. At prediction time, the output of every binary classifiers are collected to do voting for the $K$ classes. The class with the highest vote becomes the final prediction.
Compared with the One-vs-Rest strategy, the One-vs-One strategy is usually more costly to train and evaluate because more binary machines are used.
In the following, we demonstrate how to use SHOGUN
's One-vs-Rest and
One-vs-One multiclass learning strategy on the USPS dataset. For
demonstration, we randomly 200 samples from each class for training and 200
samples from each class for testing.
The LibLinear is used as the base binary classifier in a LinearMulticlassMachine, with One-vs-Rest and One-vs-One strategies. The running time and performance (on my machine) is reported below:
First we load the data and initialize random splitting:
In [ ]:
%pylab inline
%matplotlib inline
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import numpy as np
from numpy import random
from scipy.io import loadmat
import shogun as sg
import time
mat = loadmat(os.path.join(SHOGUN_DATA_DIR, 'multiclass/usps.mat'))
Xall = mat['data']
#normalize examples to have norm one
Xall = Xall / np.sqrt(sum(Xall**2,0))
Yall = mat['label'].squeeze()
# map from 1..10 to 0..9, since shogun
# requires multiclass labels to be
# 0, 1, ..., K-1
Yall = Yall - 1
N_train_per_class = 200
N_test_per_class = 200
N_class = 10
# to make the results reproducable
random.seed(0)
# index for subsampling
index = np.zeros((N_train_per_class+N_test_per_class, N_class), 'i')
for k in range(N_class):
Ik = (Yall == k).nonzero()[0] # index for samples of class k
I_subsample = random.permutation(len(Ik))[:N_train_per_class+N_test_per_class]
index[:, k] = Ik[I_subsample]
idx_train = index[:N_train_per_class, :].reshape(N_train_per_class*N_class)
idx_test = index[N_train_per_class:, :].reshape(N_test_per_class*N_class)
random.shuffle(idx_train)
random.shuffle(idx_test)
Convert features into SHOGUN format:
In [ ]:
feats_train = sg.create_features(Xall[:, idx_train])
feats_test = sg.create_features(Xall[:, idx_test])
lab_train = sg.create_labels(Yall[idx_train].astype('d'))
lab_test = sg.create_labels(Yall[idx_test].astype('d'))
define a helper function to train and evaluate multiclass machine given a strategy:
In [ ]:
def evaluate(strategy, C):
bin_machine = sg.create_machine("LibLinear", liblinear_solver_type="L2R_L2LOSS_SVC",
use_bias=True, C1=C, C2=C)
mc_machine = sg.LinearMulticlassMachine(strategy, feats_train, bin_machine, lab_train)
t_begin = time.process_time()
mc_machine.train()
t_train = time.process_time() - t_begin
t_begin = time.process_time()
pred_test = mc_machine.apply(feats_test)
t_test = time.process_time() - t_begin
evaluator = sg.create_evaluation("MulticlassAccuracy")
acc = evaluator.evaluate(pred_test, lab_test)
print("training time: %.4f" % t_train)
print("testing time: %.4f" % t_test)
print("accuracy: %.4f" % acc)
Test on One-vs-Rest and One-vs-One strategies.
In [ ]:
print("\nOne-vs-Rest")
print("="*60)
evaluate(sg.MulticlassOneVsRestStrategy(), 5.0)
print("\nOne-vs-One")
print("="*60)
evaluate(sg.MulticlassOneVsOneStrategy(), 2.0)
LibLinear also has a true multiclass SVM implemenentation - so it is worthwhile to compare training time and accuracy with the above reduction schemes:
In [ ]:
mcsvm = sg.create_machine("MulticlassLibLinear", C=5.0,
labels=lab_train, use_bias=True)
t_begin = time.process_time()
mcsvm.train(feats_train)
t_train = time.process_time() - t_begin
t_begin = time.process_time()
pred_test = mcsvm.apply(feats_test)
t_test = time.process_time() - t_begin
evaluator = sg.create_evaluation("MulticlassAccuracy")
acc = evaluator.evaluate(pred_test, lab_test)
print("training time: %.4f" % t_train)
print("testing time: %.4f" % t_test)
print("accuracy: %.4f" % acc)
As you can see performance of all the three is very much the same though the multiclass svm is a bit faster in training. Usually training time of the true multiclass SVM is much slower than one-vs-rest approach. It should be noted that classification performance of one-vs-one is known to be slightly superior to one-vs-rest since the machines do not have to be properly scaled like in the one-vs-rest approach. However, with larger number of classes one-vs-one quickly becomes prohibitive and so one-vs-rest is the only suitable approach - or other schemes presented below.
Error-Correcting Output Codes (ECOC) is a generalization of the One-vs-Rest and One-vs-One strategies. For example, we can represent the One-vs-Rest strategy with the following $K\times K$ coding matrix, or a codebook:
$$ \begin{bmatrix} +1 & -1 & -1 & \ldots & -1 & -1 \\\\ -1 & +1 & -1 & \ldots & -1 & -1\\\\ -1 & -1 & +1 & \ldots & -1 & -1\\\\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\\\ -1 & -1 & -1 & \ldots & +1 & -1 \\\\ -1 & -1 & -1 & \ldots & -1 & +1 \end{bmatrix} $$Denote the codebook by $B$, there is one column of the codebook associated with each of the $K$ classes. For example, the code for class $1$ is $[+1,-1,-1,\ldots,-1]$. Each row of the codebook corresponds to a binary coloring of all the $K$ classes. For example, in the first row, the class $1$ is colored as $+1$, while the rest of the classes are all colored as $-1$. Associated with each row, there is a binary classifier trained according to the coloring. For example, the binary classifier associated with the first row is trained by treating all the examples of class $1$ as positive examples, and all the examples of the rest of the classes as negative examples.
In this special case, there are $K$ rows in the codebook. The number of rows in the codebook is usually called the code length. As we can see, this codebook exactly describes how the One-vs-Rest strategy trains the binary sub-machines.
In [ ]:
OvR=-np.ones((10,10))
fill_diagonal(OvR, +1)
_=gray()
_=imshow(OvR, interpolation='nearest')
_=gca().set_xticks([])
_=gca().set_yticks([])
A further generalization is to allow $0$-values in the codebook. A $0$ for a class $k$ in a row means we ignore (the examples of) class $k$ when training the binary classifiers associated with this row. With this generalization, we can also easily describes the One-vs-One strategy with a $\binom{K}{2}\times K$ codebook:
$$ \begin{bmatrix} +1 & -1 & 0 & \ldots & 0 & 0 \\\\ +1 & 0 & -1 & \ldots & 0 & 0 \\\\ \vdots & \vdots & \vdots & \ddots & \vdots & 0 \\\\ +1 & 0 & 0 & \ldots & -1 & 0 \\\\ 0 & +1 & -1 & \ldots & 0 & 0 \\\\ \vdots & \vdots & \vdots & & \vdots & \vdots \\\\ 0 & 0 & 0 & \ldots & +1 & -1 \end{bmatrix} $$Here each of the $\binom{K}{2}$ rows describes a binary classifier trained with a pair of classes. The resultant binary classifiers will be identical as those described by a One-vs-One strategy.
Since $0$ is allowed in the codebook to ignore some classes, this kind of codebooks are usually called sparse codebook, while the codebooks with only $+1$ and $-1$ are usually called dense codebook.
In general case, we can specify any code length and fill the codebook arbitrarily. However, some rules should be followed:
Though you can certainly generate your own codebook, it is usually easier to
use the SHOGUN
built-in procedures to generate codebook automatically. There
are various codebook generators (called encoders) in SHOGUN
. However,
before describing those encoders in details, let us notice that a codebook
only describes how the sub-machines are trained. But we still need a
way to specify how the binary classification results of the sub-machines can be
combined to get a multiclass classification result.
Review the codebook again: corresponding to each class, there is a column. We call the codebook column the (binary) code for that class. For a new sample $x$, by applying the binary classifiers associated with each row successively, we get a prediction vector of the same length as the code. Deciding the multiclass label from the prediction vector (called decoding) can be done by minimizing the distance between the codes and the prediction vector. Different decoders define different choices of distance functions. For this reason, it is usually good to make the mutual distance between codes of different classes large. In this way, even though several binary classifiers make wrong predictions, the distance of the resultant prediction vector to the code of the true class is likely to be still smaller than the distance to other classes. So correct results can still be obtained even when some of the binary classifiers make mistakes. This is the reason for the name Error-Correcting Output Codes.
In SHOGUN
, encoding schemes are described by subclasses of
CECOCEncoder, while decoding schemes are described by subclasses
of CECOCDecoder. Theoretically, any combinations of
encoder-decoder pairs can be used. Here we will introduce several common
encoder/decoders in shogun.
Using ECOC Strategy in SHOGUN
is similar to ordinary one-vs-rest or one-vs-one. You need to choose an encoder and a decoder, and then construct a ECOCStrategy
, as demonstrated below:
In [ ]:
print("\nRandom Dense Encoder + Margin Loss based Decoder")
print("="*60)
evaluate(sg.ECOCStrategy(sg.ECOCRandomDenseEncoder(), sg.ECOCLLBDecoder()), 2.0)
Expanding on the idea of creating a generic multiclass machine and then assigning a particular multiclass strategy and a base binary machine, one can also use the KernelMulticlassMachine with a kernel of choice.
Here we will use a GaussianKernel with LibSVM as the classifer. All we have to do is define a new helper evaluate function with the features defined as in the above examples.
In [ ]:
def evaluate_multiclass_kernel(strategy):
width=2.1
epsilon=1e-5
kernel=sg.create_kernel("GaussianKernel", log_width=np.log(width))
kernel.init(feats_train, feats_train)
classifier = sg.create_machine("LibSVM", epsilon=epsilon)
mc_machine = sg.KernelMulticlassMachine(strategy, kernel, classifier, lab_train)
t_begin = time.process_time()
mc_machine.train()
t_train = time.process_time() - t_begin
t_begin = time.process_time()
pred_test = mc_machine.apply_multiclass(feats_test)
t_test = time.process_time() - t_begin
evaluator = sg.create_evaluation("MulticlassAccuracy")
acc = evaluator.evaluate(pred_test, lab_test)
print("training time: %.4f" % t_train)
print("testing time: %.4f" % t_test)
print("accuracy: %.4f" % acc)
print("\nOne-vs-Rest")
print("="*60)
evaluate_multiclass_kernel(sg.MulticlassOneVsRestStrategy())
So we have seen that we can classify multiclass samples using a base binary machine. If we dwell on this a bit more, we can easily spot the intuition behind this.
The MulticlassOneVsRestStrategy classifies one class against the rest of the classes. This is done for each and every class by training a separate classifier for it.So we will have total $k$ classifiers where $k$ is the number of classes.
Just to see this in action lets create some data using the gaussian mixture model class (GMM) from which we sample the data points.Four different classes are created and plotted.
In [ ]:
num=1000;
dist=1.0;
gmm=sg.GMM(4)
gmm.set_nth_mean(array([-dist*4,-dist]),0)
gmm.set_nth_mean(array([-dist*4,dist*4]),1)
gmm.set_nth_mean(array([dist*4,dist*4]),2)
gmm.set_nth_mean(array([dist*4,-dist]),3)
gmm.set_nth_cov(array([[1.0,0.0],[0.0,1.0]]),0)
gmm.set_nth_cov(array([[1.0,0.0],[0.0,1.0]]),1)
gmm.set_nth_cov(array([[1.0,0.0],[0.0,1.0]]),2)
gmm.set_nth_cov(array([[1.0,0.0],[0.0,1.0]]),3)
gmm.put('coefficients', array([1.0,0.0,0.0,0.0]))
x0=np.array([gmm.sample() for i in range(num)]).T
x0t=np.array([gmm.sample() for i in range(num)]).T
gmm.put('coefficients', array([0.0,1.0,0.0,0.0]))
x1=np.array([gmm.sample() for i in range(num)]).T
x1t=np.array([gmm.sample() for i in range(num)]).T
gmm.put('coefficients', array([0.0,0.0,1.0,0.0]))
x2=np.array([gmm.sample() for i in range(num)]).T
x2t=np.array([gmm.sample() for i in range(num)]).T
gmm.put('coefficients', array([0.0,0.0,0.0,1.0]))
x3=np.array([gmm.sample() for i in range(num)]).T
x3t=np.array([gmm.sample() for i in range(num)]).T
traindata=np.concatenate((x0,x1,x2,x3), axis=1)
testdata=np.concatenate((x0t,x1t,x2t,x3t), axis=1)
l0 = np.array([0.0 for i in range(num)])
l1 = np.array([1.0 for i in range(num)])
l2 = np.array([2.0 for i in range(num)])
l3 = np.array([3.0 for i in range(num)])
trainlab=np.concatenate((l0,l1,l2,l3))
testlab=np.concatenate((l0,l1,l2,l3))
In [ ]:
_=jet()
_=scatter(traindata[0,:], traindata[1,:], c=trainlab, s=100)
Now that we have the data ready , lets convert it to shogun format features.
In [ ]:
feats_tr=sg.create_features(traindata)
labels=sg.create_labels(trainlab)
The KernelMulticlassMachine is used with LibSVM as the classifer just as in the above example.
Now we have four different classes, so as explained above we will have four classifiers which in shogun terms are submachines.
We can see the outputs of two of the four individual submachines (specified by the index) and of the main machine. The plots clearly show how the submachine classify each class as if it is a binary classification problem and this provides the base for the whole multiclass classification.
In [ ]:
width=2.1
epsilon=1e-5
kernel=sg.create_kernel("GaussianKernel", log_width=np.log(width))
kernel.init(feats_tr, feats_tr)
classifier=sg.create_machine("LibSVM", epsilon=epsilon)
mc_machine=sg.KernelMulticlassMachine(sg.MulticlassOneVsRestStrategy(), kernel, classifier, labels)
mc_machine.train()
size=100
x1=linspace(-10, 10, size)
x2=linspace(-10, 10, size)
x, y=meshgrid(x1, x2)
grid=sg.create_features(np.array((ravel(x), ravel(y)))) #test features
out=mc_machine.apply_multiclass(grid) #main output
z=out.get("labels").reshape((size, size))
sub_out0=mc_machine.get_submachine_outputs(0) #first submachine
sub_out1=mc_machine.get_submachine_outputs(1) #second submachine
z0=sub_out0.get_labels().reshape((size, size))
z1=sub_out1.get_labels().reshape((size, size))
figure(figsize=(20,5))
subplot(131, title="Submachine 1")
c0=pcolor(x, y, z0)
_=contour(x, y, z0, linewidths=1, colors='black', hold=True)
_=colorbar(c0)
subplot(132, title="Submachine 2")
c1=pcolor(x, y, z1)
_=contour(x, y, z1, linewidths=1, colors='black', hold=True)
_=colorbar(c1)
subplot(133, title="Multiclass output")
c2=pcolor(x, y, z)
_=contour(x, y, z, linewidths=1, colors='black', hold=True)
_=colorbar(c2)
The MulticlassOneVsOneStrategy
is a bit different with more number of machines.
Since it trains a classifer for each pair of classes, we will have a total of $\frac{k(k-1)}{2}$ submachines for $k$ classes. Binary classification then takes place on each pair.
Let's visualize this in a plot.
In [ ]:
C=2.0
bin_machine = sg.create_machine("LibLinear", liblinear_solver_type="L2R_L2LOSS_SVC",
use_bias=True, C1=C, C2=C)
mc_machine1 = sg.LinearMulticlassMachine(sg.MulticlassOneVsOneStrategy(), feats_tr, bin_machine, labels)
mc_machine1.train()
out1=mc_machine1.apply_multiclass(grid) #main output
z1=out1.get_labels().reshape((size, size))
sub_out10=mc_machine1.get_submachine_outputs(0) #first submachine
sub_out11=mc_machine1.get_submachine_outputs(1) #second submachine
z10=sub_out10.get_labels().reshape((size, size))
z11=sub_out11.get_labels().reshape((size, size))
no_color=array([5.0 for i in range(num)])
figure(figsize=(20,5))
subplot(131, title="Submachine 1") #plot submachine and traindata
c10=pcolor(x, y, z10)
_=contour(x, y, z10, linewidths=1, colors='black', hold=True)
lab1=concatenate((l0,l1,no_color,no_color))
_=scatter(traindata[0,:], traindata[1,:], c=lab1, cmap='gray', s=100)
_=colorbar(c10)
subplot(132, title="Submachine 2")
c11=pcolor(x, y, z11)
_=contour(x, y, z11, linewidths=1, colors='black', hold=True)
lab2=concatenate((l0, no_color, l2, no_color))
_=scatter(traindata[0,:], traindata[1,:], c=lab2, cmap="gray", s=100)
_=colorbar(c11)
subplot(133, title="Multiclass output")
c12=pcolor(x, y, z1)
_=contour(x, y, z1, linewidths=1, colors='black', hold=True)
_=colorbar(c12)
The first two plots help us visualize how the submachines do binary classification for each pair. The class with maximum votes is chosen for test samples, leading to a refined multiclass output as in the last plot.