Classification

COMP4670/8600 - Introduction to Statistical Machine Learning - Tutorial 3

$\newcommand{\trace}[1]{\operatorname{tr}\left\{#1\right\}}$ $\newcommand{\Norm}[1]{\lVert#1\rVert}$ $\newcommand{\RR}{\mathbb{R}}$ $\newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $\newcommand{\DD}{\mathscr{D}}$ $\newcommand{\grad}[1]{\operatorname{grad}#1}$ $\DeclareMathOperator*{\argmin}{arg\,min}$

Setting up the environment

We use the SciPy implementation of the logistic sigmoid function, rather than (naively) implementing it ourselves, to avoid issues relating to numerical computation.


In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.optimize as opt
from scipy.special import expit # The logistic sigmoid function 

%matplotlib inline

The data set

We will predict the incidence of diabetes based on various measurements (see description). Instead of directly using the raw data, we use a normalised version, where the label to be predicted (the incidence of diabetes) is in the first column. Download the data from mldata.org.

Read in the data using pandas.


In [2]:
names = ['diabetes', 'num preg', 'plasma', 'bp', 'skin fold', 'insulin', 'bmi', 'pedigree', 'age']
data = pd.read_csv('diabetes_scale.csv', header=None, names=names)
data['diabetes'].replace(-1, 0, inplace=True) # The target variable need be 1 or 0, not 1 or -1
data.head()


Out[2]:
diabetes num preg plasma bp skin fold insulin bmi pedigree age
0 0 -0.294118 0.487437 0.180328 -0.292929 -1.000000 0.001490 -0.531170 -0.033333
1 1 -0.882353 -0.145729 0.081967 -0.414141 -1.000000 -0.207153 -0.766866 -0.666667
2 0 -0.058824 0.839196 0.049180 -1.000000 -1.000000 -0.305514 -0.492741 -0.633333
3 1 -0.882353 -0.105528 0.081967 -0.535354 -0.777778 -0.162444 -0.923997 -1.000000
4 0 -1.000000 0.376884 -0.344262 -0.292929 -0.602837 0.284650 0.887276 -0.600000

Classification via Logistic Regression

Implement binary classification using logistic regression for a data set with two classes. Make sure you use appropriate python style and docstrings.

Use scipy.optimize.fmin_bfgs to optimise your cost function. fmin_bfgs requires the cost function to be optimised, and the gradient of this cost function. Implement these two functions as cost and grad by following the equations in the lectures.

Implement the function train that takes a matrix of examples, and a vector of labels, and returns the maximum likelihood weight vector for logistic regresssion. Also implement a function test that takes this maximum likelihood weight vector and the a matrix of examples, and returns the predictions. See the section Putting everything together below for expected usage.

We add an extra column of ones to represent the constant basis.


In [3]:
data['ones'] = np.ones((data.shape[0], 1)) # Add a column of ones
data.head()
data.shape


Out[3]:
(768, 10)

The Set-up

We have 9 input variables $x_0, \dots, x_8$ where $x_0$ is the dummy input variable fixed at 1. (The fixed dummy input variable could easily be $x_5$ or $x_8$, it's index is unimportant.) We set the basis functions to the simplest choice $\phi_0(\mathbf{x}) = x_0, \dots, \phi_8(\mathbf{x}) = x_8$. Our model then has the form $$ y(\mathbf{x}) = \sigma(\sum_{j=0}^{8} w_j x_j) = \sigma(\mathbf{w}^T \mathbf{x}.) $$ Here we have a dataset, $\{(\mathbf{x}_n, t_n)\}_{n=1}^{N}$ where $t_n \in \{0, 1\}$, with $N=768$ examples. We train our model by finding the parameter vector $\mathbf{w}$ which minimizes the (data-dependent) cross-entropy error function $$ E_D(\mathbf{w}) = - \sum_{n=1}^{N} \{t_n \ln \sigma(\mathbf{w}^T \mathbf{x}_n) + (1 - t_n)\ln(1 - \sigma(\mathbf{w}^T \mathbf{x}_n))\}. $$ The gradient of this function is given by $$ \nabla E(\mathbf{w}) = \sum_{i=1}^{N} (\sigma(\mathbf{w}^T \mathbf{x}_n) - t_n)\mathbf{x}_n. $$


In [4]:
def cost(w, X, y, c=0):
    """
    Returns the cross-entropy error function with (optional) sum-of-squares regularization term.
    
    w -- parameters
    X -- dataset of features where each row corresponds to a single sample
    y -- dataset of labels where each row corresponds to a single sample
    c -- regularization coefficient (default = 0)
    """
    outputs = expit(X.dot(w)) # Vector of outputs (or predictions)
    return -( y.transpose().dot(np.log(outputs)) + (1-y).transpose().dot(np.log(1-outputs)) ) + c*0.5*w.dot(w)

def grad(w, X, y, c=0):
    """
    Returns the gradient of the cross-entropy error function with (optional) sum-of-squares regularization term.
    """
    outputs = expit(X.dot(w))
    return X.transpose().dot(outputs-y) + c*w
    
def train(X, y,c=0):
    """
    Returns the vector of parameters which minimizes the error function via the BFGS algorithm.
    """
    initial_values = np.zeros(X.shape[1]) # Error occurs if inital_values is set too high
    return opt.fmin_bfgs(cost, initial_values, fprime=grad, args=(X,y,c))

def predict(w, X):
    """
    Returns a vector of predictions. 
    """
    return expit(X.dot(w))

Performance measure

There are many ways to compute the performance of a binary classifier. The key concept is the idea of a confusion matrix or contingency table:

Label
+1 -1
Prediction +1 TP FP
-1 FN TN

where

  • TP - true positive
  • FP - false positive
  • FN - false negative
  • TN - true negative

Implement three functions, the first one which returns the confusion matrix for comparing two lists (one set of predictions, and one set of labels). Then implement two functions that take the confusion matrix as input and returns the accuracy and balanced accuracy respectively. The balanced accuracy is the average accuracy of each class.


In [5]:
def confusion_matrix(predictions, y): 
    """
    Returns the confusion matrix [[tp, fp], [fn, tn]].
    
    predictions -- dataset of predictions (or outputs) from a model
    y -- dataset of labels where each row corresponds to a single sample
    """
    tp, fp, fn, tn = 0, 0, 0, 0
    predictions = predictions.round().values # Converts to numpy.ndarray
    y = y.values
    for prediction, label in zip(predictions, y):
        if prediction == label: 
            if prediction == 1:
                tp += 1
            else:
                tn += 1
        else: 
            if prediction == 1:
                fp += 1
            else: 
                fn += 1
    return np.array([[tp, fp], [fn, tn]])

def accuracy(cm):
    """
    Returns the accuracy, (tp + tn)/(tp + fp + fn + tn).  
    """
    return cm.trace()/cm.sum()

def positive_pred_value(cm):
    """
    Returns the postive predictive value, tp/p.
    """
    return cm[0,0]/(cm[0,0] + cm[0,1])
    
def negative_pred_value(cm):
    """
    Returns the negative predictive value, tn/n.
    """
    return cm[1,1]/(cm[1,0] + cm[1,1])
    
def balanced_accuracy(cm): 
    """
    Returns the balanced accuracy, (tp/p + tn/n)/2.
    """
    return (cm[0,0]/(cm[0,0] + cm[0,1]) +  cm[1,1]/(cm[1,0] + cm[1,1]))/2

Putting everything together

Consider the following code, which trains on all the examples, and predicts on the training set. Discuss the results.


In [6]:
y = data['diabetes']
X = data[['num preg', 'plasma', 'bp', 'skin fold', 'insulin', 'bmi', 'pedigree', 'age', 'ones']]
theta_best = train(X, y)
print(theta_best)
pred = predict(theta_best, X)
cmatrix = confusion_matrix(pred, y)
[accuracy(cmatrix), balanced_accuracy(cmatrix)]


Optimization terminated successfully.
         Current function value: 361.722693
         Iterations: 18
         Function evaluations: 30
         Gradient evaluations: 30
[-1.04704922 -3.49878957  0.81102853 -0.03063881  0.50408756 -3.00946746
 -1.10680533 -0.44607073  0.19501805]
Out[6]:
[0.78255208333333337, 0.76912964680456408]

To aid our discussion we give the positive predictive value (PPV) and negative predictive value (NPV) also.


In [8]:
[positive_pred_value(cmatrix), negative_pred_value(cmatrix)]


Out[8]:
[0.79892280071813282, 0.73933649289099523]

Discussion

Overall, the accuracy of our model is reasonable, given our naive choice of basis functions, as is its balanced accuracy. The discrepancy between these values can be accounted for by the PPV being higher than the NPV.

(optional) Effect of regularization parameter

By splitting the data into two halves, train on one half and report performance on the second half. By repeating this experiment for different values of the regularization parameter $\lambda$ we can get a feeling about the variability in the performance of the classifier due to regularization. Plot the values of accuracy and balanced accuracy for at least 3 different choices of $\lambda$. Note that you may have to update your implementation of logistic regression to include the regularisation parameter.


In [9]:
def split_data(data):
    """
    Randomly split data into two equal groups.
    """
    np.random.seed(1)
    N = len(data)
    idx = np.arange(N)
    np.random.shuffle(idx)
    train_idx = idx[:int(N/2)]
    test_idx = idx[int(N/2):]

    X_train = data.loc[train_idx].drop('diabetes', axis=1)
    t_train = data.loc[train_idx]['diabetes']
    X_test = data.loc[test_idx].drop('diabetes', axis=1)
    t_test = data.loc[test_idx]['diabetes']
    
    return X_train, t_train, X_test, t_test

def reg_coefficient_comparison(reg_coefficients, X_train, t_train, X_test, t_test):
    """
    Returns the accuracy and balanced accuracy for the given regularization coefficient values.
    
    reg_coefficients -- list of regularization coefficient values
    X_train -- the input dataset used for training
    t_train -- the dataset of labels used for training
    X_test -- the input dataset used to make predictions from the trained model
    t_test -- dataset of labels for performance assessment 
    """
    summary = []
    for c in reg_coefficients:
        w_best = train(X_train, t_train, c)
        predictions = predict(w_best, X_test)
        cm = confusion_matrix(predictions, t_test)
        summary.append([c, accuracy(cm), balanced_accuracy(cm)])
    return pd.DataFrame(summary, columns=["regularization coefficient", "accuracy", "balanced accuracy"])

X_train, t_train, X_test, t_test = split_data(data)
reg_coefficients = [0, 0.01, 0.1, 0.25, 0.5, 1, 1.5, 1.75, 2, 5, 9, 10, 11, 20, 100, 150]
reg_coefficient_comparison(reg_coefficients, X_train, t_train, X_test, t_test)


Optimization terminated successfully.
         Current function value: 175.645824
         Iterations: 18
         Function evaluations: 28
         Gradient evaluations: 28
Optimization terminated successfully.
         Current function value: 175.791523
         Iterations: 18
         Function evaluations: 28
         Gradient evaluations: 28
Optimization terminated successfully.
         Current function value: 177.046702
         Iterations: 17
         Function evaluations: 28
         Gradient evaluations: 28
Optimization terminated successfully.
         Current function value: 178.946282
         Iterations: 17
         Function evaluations: 28
         Gradient evaluations: 28
Optimization terminated successfully.
         Current function value: 181.702441
         Iterations: 18
         Function evaluations: 28
         Gradient evaluations: 28
Optimization terminated successfully.
         Current function value: 186.174289
         Iterations: 16
         Function evaluations: 28
         Gradient evaluations: 28
Optimization terminated successfully.
         Current function value: 189.748564
         Iterations: 19
         Function evaluations: 29
         Gradient evaluations: 29
Optimization terminated successfully.
         Current function value: 191.302245
         Iterations: 16
         Function evaluations: 28
         Gradient evaluations: 28
Optimization terminated successfully.
         Current function value: 192.733938
         Iterations: 16
         Function evaluations: 28
         Gradient evaluations: 28
Optimization terminated successfully.
         Current function value: 204.480698
         Iterations: 16
         Function evaluations: 27
         Gradient evaluations: 27
Optimization terminated successfully.
         Current function value: 213.232914
         Iterations: 15
         Function evaluations: 26
         Gradient evaluations: 26
Optimization terminated successfully.
         Current function value: 214.838433
         Iterations: 15
         Function evaluations: 26
         Gradient evaluations: 26
Optimization terminated successfully.
         Current function value: 216.290481
         Iterations: 15
         Function evaluations: 25
         Gradient evaluations: 25
Optimization terminated successfully.
         Current function value: 225.175215
         Iterations: 17
         Function evaluations: 26
         Gradient evaluations: 26
Optimization terminated successfully.
         Current function value: 244.154472
         Iterations: 18
         Function evaluations: 32
         Gradient evaluations: 32
Optimization terminated successfully.
         Current function value: 247.824279
         Iterations: 17
         Function evaluations: 32
         Gradient evaluations: 32
Out[9]:
regularization coefficient accuracy balanced accuracy
0 0.00 0.765625 0.741143
1 0.01 0.765625 0.741143
2 0.10 0.765625 0.741822
3 0.25 0.770833 0.750104
4 0.50 0.770833 0.750104
5 1.00 0.770833 0.751127
6 1.50 0.770833 0.751127
7 1.75 0.763021 0.742502
8 2.00 0.757812 0.736642
9 5.00 0.765625 0.752150
10 9.00 0.750000 0.747225
11 10.00 0.755208 0.758935
12 11.00 0.755208 0.762150
13 20.00 0.726562 0.748771
14 100.00 0.666667 0.707895
15 150.00 0.658854 0.330287

Discussion

It appears to be the case that accuracy is maximized for a regularization coefficient of approximately 1, while balanced accuracy is maximized for a regularization coefficient of approximately 11.

Discussion

Here we discuss possible approaches to improve our predictions. We made the decision to set the basis functions to the simplest choice $\phi_0(\mathbf{x}) = x_0, \dots, \phi_8(\mathbf{x}) = x_8$. It is possible that making use of nonlinear basis functions, for instance, polynomial basis functions, may improve our predictive ability. This then raises the question of how to choose appropriate basis functions given that for data higher than 2 or 3 dimensions it is difficult to make choices based off straight-forward visualization. From the description of the dataset we know also that their was missing data.

"Until 02/28/2011 this web page indicated that there were no missing values in the dataset. As pointed out by a repository user, this cannot be true: there are zeros in places where they are biologically impossible, such as the blood pressure attribute. It seems very likely that zero values encode missing data. However, since the dataset donors made no such statement we encourage you to use your best judgement and state your assumptions."

It is likely that if our dataset were more complete, our model would have stronger predictive abilities.