Comparison of the accuracy of a cutting plane active learning procedure using the (i) analytic center; (ii) Chebyshev center; and (iii) random center on the diabetes data set

The set up


In [1]:
import numpy as np
import pandas as pd
import active
import experiment
import logistic_regression as logr
from sklearn import datasets # The Iris dataset is imported from here.
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline

%load_ext autoreload
%autoreload 1
%aimport active
%aimport experiment
%aimport logistic_regression

np.set_printoptions(precision=4)

In [2]:
plt.rcParams['axes.labelsize'] = 15
plt.rcParams['axes.titlesize'] = 15
plt.rcParams['xtick.labelsize'] = 15
plt.rcParams['ytick.labelsize'] = 15
plt.rcParams['legend.fontsize'] = 15
plt.rcParams['figure.titlesize'] = 18

Importing and processing the diabetes data set

In this experiment we work with a data set with 2 classes and 8 features where the classes are divided between whether or not a patient has diabetes and the 8 features correspond to 8 health measurements.

The 2 classes in this data set are not linearly seperable and it is known that the data set has missing variables.

We work with all features of the data set and randomly divide the data set into two halves, to be used for training and testing.


In [3]:
names = ['diabetes', 'num preg', 'plasma', 'bp', 'skin fold', 'insulin', 'bmi', 'pedigree', 'age']
data = pd.read_csv('diabetes_scale.csv', header=None, names=names)
data['ones'] = np.ones((data.shape[0], 1)) # Add a column of ones
data.head()


Out[3]:
diabetes num preg plasma bp skin fold insulin bmi pedigree age ones
0 -1 -0.294118 0.487437 0.180328 -0.292929 -1.000000 0.001490 -0.531170 -0.033333 1.0
1 1 -0.882353 -0.145729 0.081967 -0.414141 -1.000000 -0.207153 -0.766866 -0.666667 1.0
2 -1 -0.058824 0.839196 0.049180 -1.000000 -1.000000 -0.305514 -0.492741 -0.633333 1.0
3 1 -0.882353 -0.105528 0.081967 -0.535354 -0.777778 -0.162444 -0.923997 -1.000000 1.0
4 -1 -1.000000 0.376884 -0.344262 -0.292929 -0.602837 0.284650 0.887276 -0.600000 1.0

In [4]:
np.random.seed(1)
size = data.shape[0]
index = np.arange(size)
np.random.shuffle(index)
training_index = index[:int(size/2)]
testing_index = index[int(size/2):]

Experimental procedure

See Section 7.5 of the report.

Logistic regression


In [5]:
Y = data['diabetes']
X = data[['num preg', 'plasma', 'bp', 'skin fold', 'insulin', 'bmi', 'pedigree', 'age', 'ones']]
X = np.array(X)
Y = np.array(Y)
Y[Y==-1] = 0

X_diabetes_training = X[training_index]
Y_diabetes_training = Y[training_index]
X_diabetes_testing = X[testing_index]
Y_diabetes_testing = Y[testing_index]

In [6]:
print(X_diabetes_testing)


[[-0.1765  0.2462  0.1475 ..., -0.9291 -0.4667  1.    ]
 [-0.7647  0.0854 -0.1475 ..., -0.795  -0.9667  1.    ]
 [ 0.0588  0.1256  0.3443 ...,  0.0282 -0.0333  1.    ]
 ..., 
 [ 0.5294  0.2663  0.4754 ..., -0.5687 -0.3     1.    ]
 [-0.5294  0.7186  0.1803 ..., -0.6576 -0.8333  1.    ]
 [ 0.0588  0.0251  0.2459 ..., -0.4987 -0.1667  1.    ]]

In [7]:
print(Y_diabetes_testing)


[1 1 0 0 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 0 0 1
 0 1 1 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 0
 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 1 1 1
 1 1 0 1 0 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1
 1 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1
 0 1 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0
 1 1 1 0 1 1 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1
 1 0 0 1 1 1 0 0 1 1 0 0 1 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0
 1 1 1 0 1 0 1 0 1 1 1 0 1 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 1 1 0 0 0
 1 0 1 0 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0
 0 1 1 1 1 0 1 0 1 1 0 0 0 0]

In [8]:
print(X_diabetes_training)


[[-0.1765  0.3668  0.2131 ..., -0.5141  0.      1.    ]
 [-0.8824  0.5176 -0.0164 ..., -0.9137 -0.9667  1.    ]
 [-0.2941  0.0955 -0.0164 ..., -0.8907 -0.8     1.    ]
 ..., 
 [ 0.0588  0.2362  0.1475 ..., -0.7472 -0.3667  1.    ]
 [-1.      0.3869 -0.0164 ..., -0.6106 -1.      1.    ]
 [-0.4118  0.2261  0.4098 ..., -0.819  -0.6     1.    ]]

In [9]:
print(Y_diabetes_training)


[1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1 1 0 0 0 0 1 1 1 0 1 0 0 1 1 0 1 0 1
 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 1 0 1 0 0 1 0 0 1 1 1
 0 0 0 1 1 0 0 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 0 1 0
 1 1 1 0 1 1 0 1 0 1 0 0 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 1
 0 1 1 0 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 0 1 1 1 0 1 1 0 1
 1 1 1 0 1 1 1 1 1 0 1 1 1 0 0 0 1 1 1 0 0 1 1 1 0 1 1 0 0 1 0 0 1 1 0 1 1
 1 0 0 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 1 0 1 1 1 1 0
 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 0 1 0 0 1
 1 0 1 0 1 0 1 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0
 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1
 1 0 1 1 1 0 0 1 0 1 1 1 0 1]

Using logistic regression as a benchmark


In [66]:
n = 10
iterations = 15
X_training = X_diabetes_training
Y_training = Y_diabetes_training
X_testing = X_diabetes_testing
Y_testing = Y_diabetes_testing

Here we compute the average accuracy over 10 tests, where for each test the accuracy is computed by training logistic regression on 1 to 15 randomly selected patterns from the same fixed training set, which comprises a uniform sample of half the data set.

We note here that the optimization process sometimes fails to converge.


In [53]:
average_accuracies_logr_15 = \
logr.experiment(n, iterations, X_testing, Y_testing, X_training, Y_training)

In [54]:
print(average_accuracies_logr_15)


[ 0.5     0.593   0.6357  0.6594  0.6878  0.6786  0.6596  0.6922  0.6758
  0.6815  0.6711  0.668   0.681   0.6807  0.6898]

Here we compute the average accuracy over 10 tests, where for each test the accuracy is computed by training logistic regression on 1 to 30 randomly selected patterns from the same fixed training set, which comprises a uniform sample of half the data set.

We compute the accuracy from training logistic regression on 50% and 100% of the training data set also.

We note here that the optimization process sometimes fails to converge.


In [55]:
iterations = 30

In [56]:
average_accuracies_logr_30 = \
logr.experiment(n, iterations, X_testing, Y_testing, X_training, Y_training)

In [57]:
print(average_accuracies_logr_30)


[ 0.5     0.593   0.6357  0.6594  0.6878  0.6786  0.6596  0.6922  0.6758
  0.6815  0.6711  0.668   0.681   0.6807  0.6898  0.6893  0.681   0.6826
  0.6927  0.694   0.6982  0.7049  0.7062  0.7068  0.7096  0.7148  0.712
  0.7112  0.7125  0.712 ]

Here we train logistic regression on 50% of the training data set, selected random uniformly, and repeat this 10 times to compute the average accuracy.


In [58]:
n = 10

In [59]:
size = X_training.shape[0]
size_half = int(size/2)
accuracies = []
for i in range(n):
    index_all = np.arange(size)
    np.random.shuffle(index_all)
    index_half = index_all[:size_half]
    X_training_half = X_training[index_half]
    Y_training_half = Y_training[index_half]
    w_half = logr.train(X_training_half, Y_training_half)
    predictions = logr.predict(w_half, X_training_half)
    accuracy = logr.compute_accuracy(predictions, Y_training_half)
    accuracies.append(accuracy)
accuracies = np.array(accuracies)
average_accuracy_training_half = np.sum(accuracies)/n

In [60]:
print('The average accuracy training on 50% of the training data set is',\
      average_accuracy)


The average accuracy training on 50% of the training data set is 0.777083333333

Here we simply train logistic regression on half of the data and test it on the other half.


In [67]:
w_training = logr.train(X_training, Y_training)
predictions = logr.predict(w_training, X_testing)
accuracy_training_all = logr.compute_accuracy(predictions, Y_testing)

In [68]:
print("The accuracy training on the whole training data set is", accuracy)


The accuracy training on the whole training data set is 0.7604166666666666

Average accuracy of the cutting plane active learning procedure over 10 tests and 15 iterations using the (i) analytic center; (ii) Chebyshev center; and (iii) random center


In [72]:
Y = data['diabetes']
X = data[['num preg', 'plasma', 'bp', 'skin fold', 'insulin', 'bmi', 'pedigree', 'age', 'ones']]
X = np.array(X)
Y = np.array(Y)

X_diabetes_training = X[training_index]
Y_diabetes_training = Y[training_index]
X_diabetes_testing = X[testing_index]
Y_diabetes_testing = Y[testing_index]

In [73]:
n = 10
iterations = 15
X_testing = X_diabetes_testing
Y_testing = Y_diabetes_testing
X_training = X_diabetes_training
Y_training = Y_diabetes_training

In [74]:
average_accuracies_ac_15 = \
experiment.experiment(n, iterations, X_testing, Y_testing, 
                      X_training, Y_training, center='ac', 
                      sample=1, M=None)


**** Starting or restarting experiment ****
10 random vectors have been generated so far
100 random vectors have been generated so far
1000 random vectors have been generated so far
10000 random vectors have been generated so far
100000 random vectors have been generated so far
1000000 random vectors have been generated so far

In [75]:
average_accuracies_cc_15 = \
experiment.experiment(n, iterations, X_testing, Y_testing, 
                      X_training, Y_training, center='cc', 
                      sample=1, M=None)


**** Starting or restarting experiment ****
10 random vectors have been generated so far
100 random vectors have been generated so far
1000 random vectors have been generated so far
10000 random vectors have been generated so far
100000 random vectors have been generated so far
1000000 random vectors have been generated so far

In [76]:
average_accuracies_rand_15 = \
experiment.experiment(n, iterations, X_testing, Y_testing, 
                      X_training, Y_training, center='random', 
                      sample=1, M=None)


**** Starting or restarting experiment ****
10 random vectors have been generated so far
100 random vectors have been generated so far
1000 random vectors have been generated so far
10000 random vectors have been generated so far
100000 random vectors have been generated so far
1000000 random vectors have been generated so far

In [82]:
plt.figure(figsize=(12,7))

queries = np.arange(1, iterations + 1)
plt.plot(queries, average_accuracies_logr_15, 'mx-', label='LR', 
         markevery=5,
         lw=1.5, ms=10, markerfacecolor='none', markeredgewidth=1.5,
         markeredgecolor = 'm')

plt.plot(queries, average_accuracies_ac_15, 'r^-', label='AC', 
         markevery=5,
         lw=1.5, ms=10, markerfacecolor='none', markeredgewidth=1.5,
         markeredgecolor = 'r')

plt.plot(queries, average_accuracies_cc_15, 'go-', label='CC', 
         markevery=5,
         lw=1.5, ms=10, markerfacecolor='none', markeredgewidth=1.5,
         markeredgecolor = 'g')

plt.plot(queries, average_accuracies_rand_15, 'bs-', label='Random', 
         markevery=5,
         lw=1.5, ms=10, markerfacecolor='none', markeredgewidth=1.5,
         markeredgecolor = 'b')

plt.plot(queries, [average_accuracy_training_half]*queries.shape[0], 'k--', 
         color = '0.4', label='LR - half',lw=1.5, ms=10)

plt.plot(queries, [accuracy_training_all]*queries.shape[0], 'k-', 
         color = '0.4', label='LR - all',lw=1.5, ms=10)

plt.xlabel('Number of iterations')
plt.ylabel('Accuracy averaged over %d tests' % n)
plt.title('Average accuracy of a cutting plane active learning procedure (diabetes data set)')
plt.legend(loc='best')

plt.savefig('diabetes_experiment_all_15.png', dpi=600, bbox_inches='tight', transparent=True)
plt.show()