Comparison of the accuracy of a cutting plane active learning procedure using the (i) analytic center; (ii) Chebyshev center; and (iii) random center on the Iris flower data set

The set up



In [1]:

    
import numpy as np
import active
import experiment
import logistic_regression as logr
from sklearn import datasets # The Iris dataset is imported from here.
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline

%load_ext autoreload
%autoreload 1
%aimport active
%aimport experiment
%aimport logistic_regression

np.set_printoptions(precision=4)



In [2]:

    
plt.rcParams['axes.labelsize'] = 15
plt.rcParams['axes.titlesize'] = 15
plt.rcParams['xtick.labelsize'] = 15
plt.rcParams['ytick.labelsize'] = 15
plt.rcParams['legend.fontsize'] = 15
plt.rcParams['figure.titlesize'] = 18

Importing and processing the Iris data set

In this experiment we work with the classic Iris flower data set. The Iris flower data set consists of 3 classes of 50 instances where each class corresponds to a different species of the Iris flower. For each instance there are 4 features.

This data set is useful as it is known that one of the classes is linearly seperable from the other 2. (In the other experiment the data set used, the Pima Indians diabetes data set, is not linearly seperable.)

For simplicity, we work with the first two features of the data set, sepal length in cm and sepal width in cm, and label the class of Iris Setosa flowers 1 and the other two classes, Iris Versicolour and Iris Virginica, -1.

We will randomly divide the data set into two halves, to be used for training and testing.



In [3]:

    
# This code was adapted from 
# http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html#

iris = datasets.load_iris()
X = iris.data[:, :2]  # Take the first two features.
Y = iris.target



In [4]:

    
print('X has shape', X.shape)
print('Y has shape', Y.shape)









    



X has shape (150, 2)
Y has shape (150,)



In [5]:

    
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

plt.figure(2, figsize=(12, 7))
plt.clf()
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')
plt.title('The Iris flower data set')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())

plt.savefig('iris.png', dpi=600, bbox_inches='tight', transparent=True)
plt.show()



In [6]:

    
bias = np.ones((X.shape[0], 1)) # Add a bias variable set to 1.
X = np.hstack((X, bias))

Y[Y==1] = -1
Y[Y==2] = -1
Y[Y==0] = +1

np.random.seed(1)
size = X.shape[0]
index = np.arange(size)
np.random.shuffle(index)
training_index = index[:int(size/2)]
testing_index = index[int(size/2):]

X_iris_training = X[training_index]
Y_iris_training = Y[training_index]
X_iris_testing = X[testing_index]
Y_iris_testing = Y[testing_index]



In [7]:

    
n = 10
iterations = 75
X_testing = X_iris_testing
Y_testing = Y_iris_testing
X_training = X_iris_training
Y_training = Y_iris_training

Experimental procedure

See Section 7.5 of the report.

Logistic regression



In [8]:

    
Y_training[Y_training== -1] = 0
Y_testing[Y_testing==-1] = 0



In [9]:

    
Y_training









    Out[9]:





array([1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0])



In [10]:

    
Y_testing









    Out[10]:





array([0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1])



In [ ]:



In [11]:

    
average_accuracies_logr = \
logr.experiment(n, iterations, X_testing, Y_testing, X_training, Y_training)



In [13]:

    
print(average_accuracies_logr)









    



[ 0.5307  0.8147  0.94    0.9347  0.9467  0.9853  0.9867  0.9867  0.9867
  0.9867  0.9867  0.9853  0.9853  0.9853  0.9853  0.9853  0.9853  0.984
  0.9853  0.984   0.9853  0.9853  0.9853  0.9853  0.9853  0.9853  0.9853
  0.9853  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867
  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867
  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867
  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867
  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867  0.9867
  0.9867  0.9867  0.9867]



In [14]:

    
w_best = logr.train(X_training, Y_training)
print('w_best is', w_best)
predictions = logr.predict(w_best, X_testing)
print('Using w_best the accuracy is', \
      logr.compute_accuracy(predictions, Y_testing))









    



w_best is [-24.8664  41.262    5.1263]
Using w_best the accuracy is 0.9866666666666667

The experiment



In [15]:

    
Y_training[Y_training==0] = -1
Y_testing[Y_testing==0] = -1



In [16]:

    
Y_training









    Out[16]:





array([ 1, -1, -1,  1, -1, -1, -1,  1,  1, -1, -1,  1, -1, -1, -1,  1, -1,
       -1,  1,  1, -1, -1, -1,  1, -1, -1,  1,  1, -1, -1, -1, -1, -1, -1,
       -1,  1, -1,  1, -1, -1, -1,  1, -1, -1, -1, -1,  1,  1,  1, -1,  1,
        1, -1, -1, -1, -1, -1, -1, -1, -1,  1, -1, -1,  1,  1, -1,  1, -1,
       -1, -1, -1, -1, -1,  1, -1])



In [17]:

    
Y_testing









    Out[17]:





array([-1, -1, -1, -1, -1,  1,  1,  1, -1,  1, -1, -1, -1,  1,  1, -1,  1,
       -1, -1, -1, -1, -1, -1, -1, -1,  1, -1,  1, -1, -1,  1, -1,  1,  1,
       -1, -1, -1,  1,  1, -1,  1, -1,  1, -1, -1,  1, -1,  1, -1,  1, -1,
       -1,  1,  1, -1,  1, -1, -1,  1, -1, -1, -1, -1, -1,  1,  1, -1, -1,
       -1, -1, -1, -1, -1, -1,  1])



In [18]:

    
average_accuracies_ac = \
experiment.experiment(n, iterations, X_testing, Y_testing, 
                      X_training, Y_training, center='ac', 
                      sample=1, M=None)









    



**** Starting or restarting experiment ****
10 random vectors have been generated so far
100 random vectors have been generated so far
1000 random vectors have been generated so far
10000 random vectors have been generated so far
100000 random vectors have been generated so far



In [19]:

    
average_accuracies_cc = \
experiment.experiment(n, iterations, X_testing, Y_testing, 
                      X_training, Y_training, center='cc', 
                      sample=1, M=None)









    



**** Starting or restarting experiment ****
10 random vectors have been generated so far
100 random vectors have been generated so far
1000 random vectors have been generated so far
10000 random vectors have been generated so far
100000 random vectors have been generated so far



In [20]:

    
average_accuracies_rand = \
experiment.experiment(n, iterations, X_testing, Y_testing, 
                      X_training, Y_training, center='random', 
                      sample=1, M=None)









    



**** Starting or restarting experiment ****
10 random vectors have been generated so far
100 random vectors have been generated so far
1000 random vectors have been generated so far
10000 random vectors have been generated so far
100000 random vectors have been generated so far



In [23]:

    
plt.figure(figsize=(12,7))

queries = np.arange(1, iterations + 1)
plt.plot(queries, average_accuracies_logr, 'mx-', label='LR', 
         markevery=5,
         lw=1.5, ms=10, markerfacecolor='none', markeredgewidth=1.5,
         markeredgecolor = 'm')

plt.plot(queries, average_accuracies_ac, 'r^-', label='AC', 
         markevery=5,
         lw=1.5, ms=10, markerfacecolor='none', markeredgewidth=1.5,
         markeredgecolor = 'r')

plt.plot(queries, average_accuracies_cc, 'go-', label='CC', 
         markevery=5,
         lw=1.5, ms=10, markerfacecolor='none', markeredgewidth=1.5,
         markeredgecolor = 'g')

plt.plot(queries, average_accuracies_rand, 'bs-', label='Random', 
         markevery=5,
         lw=1.5, ms=10, markerfacecolor='none', markeredgewidth=1.5,
         markeredgecolor = 'b')

plt.xlabel('Number of iterations')
plt.ylabel('Accuracy averaged over %d tests' % n)
plt.title('Average accuracy of a cutting plane active learning procedure (Iris flower data set)')
plt.legend(loc='best')

plt.savefig('iris_experiment.png', dpi=600, bbox_inches='tight', transparent=True)
plt.show()

References:

[1] http://archive.ics.uci.edu/ml/datasets/Iris [2] https://en.wikipedia.org/wiki/Iris_flower_data_set [3] http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html# [4] https://arxiv.org/pdf/1508.02986.pdf