Example 5.4: Classifying Malignant/Benign Breast Tumors with Artificial Neural Networks

Welcome to the practical section of module 5.4. Here we'll be using Artificial Neural Networks with the Wisconsin Breast Cancer Database just like in the practical example for module 5.2 to predict whether a patient's tumor is benign or malignant based on tumor cell charactaristics. This is just one example from many to which machine learning and classification could offer great insights and aid. Make sure to delete any rows with missing data (which will contain a "?" character in a feature cell).

By the end of the module, we'll have a trained an artificial neural network model on the a subset of the features presented in the dataset that is very accurate at diagnosing the condition of the tumor based on these features. We'll also see how we can make interseting inferences from the model that could be helpful for the physicians in diagnosing cancer.

Since scikit-learn's newest stable version does not support neural networks / multi-layer perceptrons, we will be using the scikit-neuralnetwork third-party implementation. To install scikit-neuralnetwork, please consult the Installation section of the documentation.

First, we will import all our dependencies. Make sure to install all of these separately:



In [52]:

    
import pandas as pd # we use this library to import a CSV of cancer tumor data
import numpy as np # we use this library to help us represent traditional Python arrays/lists as matrices/tensors with linear algebra operations

from sknn.mlp import Classifier, Layer # we use this library for the actual neural network code
from sklearn.utils import shuffle # we use this library for randomly shuffling arrays/tensors

import sys # we use for accessing output window 
import logging # we use this library for outputting real time statistics/updates on training progress
logging.basicConfig(format="%(message)s", level=logging.DEBUG, stream=sys.stdout) # set the logging mode to DEBUG to output training information, use "INFO" for less volume of output

The Data

We'll start off by exploring our dataset to see what attributes we have and how the class of the tumor is represented

Before we proceed, ensure to include headers to the dataset provided by the University of Wisconsin. We will use the following headers:

ID,CT,UCS,UCSh,MA,SECS,BN,BC,NN,M,Class

Add this to the beginning line of your .csv file.



In [2]:

    
dataset = pd.read_csv('./datasets/breast-cancer-wisconson.csv') # import the CSV data into an array using the panda dependency
print dataset[:10]

To understand the meaning of the abbreviations we can consult the dataset's website to find a description of each attribute in order. We are going to train on all features unlike the logistic regression example (where we just trained on three). This does mean that we will be unable to visualize the results, but will get a feel for how to work with high-dimensional data.

If you noticed the Class attribute at the end (which gives the class of the tumor), you'll find that it takes either 2 or 4, where 2 represents a benign tumor while 4 represents a malignant tumor. We'll change that to more expressive values and make a benign tumor represented by 0 (false) and mlignants by 1s (true).

You'll notice that the ID attribute of data that is useless to our modelling, since it provides no information about the tumor itself, and is instead a way of identifying a specific tumor. We will hence strip this from our dataset before training.



In [ ]:

    
dataset = dataset[["CT", "UCS", "UCSh", "MA", "SECS", "BN", "BC", "NN", "M", "Class"]]  # remove the ID attribute from the dataset
dataset.is_copy = False  # this is just to hide a nasty warning!
[0 if tclass == 2 else 1 for tclass in dataset["Class"]]  # convert Class attributes to 0/1 if they are 2/4 in dataset["Class"] column

We will now need to split up the dataset into two separate tensors: X and y. X will contain the features and their values for each training example, and y will contain all the outputs. In this training set, let's say m refers to the number of training examples (of which there are just over 600) and n refers to the number of features (of which there are 9). Thus, X will be a matrix where $X\in \mathbb{R}^{m\:\cdot \:n}$ and y is a vector where $y\in \mathbb{R}^m$ (because we only have one output - a probability of the tumor being malignant).

We simply separate by the "Class" attribute and the other features.



In [ ]:

    
X = np.array(dataset[["CT", "UCS", "UCSh", "MA", "SECS", "BN", "BC", "NN", "M"]]) # X is composed of all the n feature columns
y = np.array(dataset["Class"]) # y is composed of just the output class column

Training the Neural Network Model

For the training, we are going to split the dataset into a training set and a test set. The training set will be 70% of the original data set, and will be what the neural network will learn from. We will test the accuracy of the neural network's learned weights by using the test set, which is composed of 30% of the original data.

It's very important to shuffle the dataset before partitioning into training/test sets. Why? Because the data given to us by University of Wisconsin may be in some sorted (apparent or unapparent) order. It may be that most y=0 examples are in the latter half of the data. It may be that most of the nearby recorded patients are similar, or were recorded during similar dates/times. We want none of that - we want to remove any information and ensure that our partitioning is random, so that our test results represent true probabilites of picking a random training case from an entire population of permutations of the feature vector. For example, I once got a 5% difference in test results since the data I had was sorted beforehand. Essentially, we want to make sure that we're being absolutely fair about the partition, and not accidentally making our test results too good/too bad.

We can't use numpy's traditional shuffle function because this shuffles one array only. If we independently shuffled x and y, the order between them would be lost (ie. if training case x had output of 1 beforehand, this may be accidentally changed to an output of 0, since we use corresponding indicdes to match up the inputs and outputs).



In [ ]:

    
X, y = shuffle(X, y, random_state=0) # we use scikit-learn's synchronized shuffle feature to shuffle two arrays in unison



In [48]:

    
dataset_size = len(dataset) # get the size of overall dataset
training_size = np.floor(dataset_size * 0.7).astype(int) # get the training size as 70% of the dataset size (or roughly 0.7 * dataset_size) and as an integer

X_train = X[:training_size] # extract the first 70% of inputs for training
y_train = y[:training_size] # extract the first 70% of inputs for training

X_test = X[training_size:] # extract rest 30% of inputs for testing
y_test = y[training_size:] # extract rest 30% of outputs for testing

scikit-neuralnetwork offers a really neat, easy to use API (from sknn.mlp import Classifier, Layer) for training neural networks. This API has support for many different paradigms like dropout, momentum, weight decay, mini-batch gradient descent etc. and even different neural network types like Convolutional Neural Networks. Today, however, our goal is to get a simple Artificial Neural Network setup!

Our first job is to configure the architecture for the neural network. We will need to decide:

The number of hidden layers
The size of each hidden layer
The activation function used at each hidden layer
The learning rate, number of iterations, and other hyperparameters

Some types of activation functions offered by scikit-neuralnetwork include:

Linear
Rectifier
Sigmoid
Softmax

Where Softmax computes a sigmoid probability distribution over multiple outputs. Generally, it is conventional to use Softmax as the activation function for the output layer (when we have 1 output it really is just the same as a Sigmoid layer, but scikit-neuralnetwork will still output a warning). Recall the formula for the sigmoid activation function:

$Sigmoid\left(z\right)=\frac{1}{1+e^{-z}}$

This function "squeezes" any real value into a probability of range (0, 1).

The Linear and Rectifier (ReLU)) may be used, but we obviously need to use Sigmoid/Softmax in our neural network because we are performing a classification task. Generally, I found that using Rectifier units throughout and then having a Softmax (Sigmoid) layer at the end produced the best results.

Now we need to decide on our structure (as in, the hidden layers and their sizes). Our neural network will end up looking like the following:

That is, for this example, we will use two hidden layers (both of which are Rectifiers). Generally, the greater number of hidden layers you have, the greater complexity. Two should be fine for our task though. In our neural network, we will have 9 input nodes (for our 9 features), and I have chosen 100 neurons for the first hidden layer as well as 50 for the second. The Softmax layer will just have one output because we only want one output (probability of malignancy). You can play around with these numbers, bar the Softmax output layer :) We really need not that many hidden units in each hidden layer, but I want to demonstrate the scale we can create with this API.

NOTE: In actuality, our Softmax layer for this coding example will have two outputs. For classifiers, whether its binary classification or multi-class, scikit-neuralnetwork uses a one-hot-encoded representation of the labels with cross-entropy-loss. This requires that the output layer (the softmax layer) to be a probability distribution over all labels, hence the number of the units in the softmax layer needs to be the number of labels. In binary classification, we have 2 labels: 0 and 1, so the expected behavior is for the softmax layer to have 2 units.

Finally, we need to choose our hyperparameters. A learning rate of 0.001 and number of iterations/epochs of 100 should suffice. Our code will look like the following:



In [ ]:

    
nn = Classifier(layers=[ # create a new Classifier object (neural network classifier), and pass all the layers as parameters
        Layer("Rectifier", units=100), # create the first post-input hidden layer, a Rectifier (ReLU) layer of 100 units
        Layer("Rectifier", units=50), # create the second hidden layer, a Rectifier layer of 50 units
        Layer("Softmax"), units=2], # create the final output layer, a Softmax layer that will output two probabilities, as mentioned before
        learning_rate=0.001, n_iter=100) # pass in hyperparameters as a separate parameter to the layers

Our neural network has been configured and built! Now, we just need to train it using our training set, using the intuitively named function "fit":



In [ ]:

    
nn.fit(X_train, y_train) # begin training using backpropagation!

The output in my console due to the DEBUG logging looked like this (it's fun to see training in action!):

NOTE: We do not need to prefix 1s to the dataset beforehand to achieve bias terms that vertically shift the decision boundary regions. The API provides this by default.

Results

We can see the weights that were produced (including the bias terms) using the following line of code:



In [ ]:

    
print nn.get_parameters() # output the weights of the neural network

Unlike logistic regression, the weights of neural networks, unfortunately, are not very interpretable. There are also a high number of these weights. For example, the weights my system produced were outputted as so:

However, it is more important to output the acurracy of these results, and how much error is made. Unlike logistic regression where we used a cost function that outputted a real number, we are going to find the percent error of our system like so:

Create predictions in the form of probabilities for each test tumor being malignant
Iterate through the test set with index i
Fetch the predicted output probability of this test tumor
Round this probability to either a 1 (malignant) or a 0 (benign)
Compare this to the correct test output at the ith index
If an error occurs, increment some pre-initialized error counter
Get the percentage error by dividing the error counter by the total number of test examples, multiplying by 100



In [ ]:

    
error_count = 0.0 # initialize the error counter
prob_predictions = nn.predict_proba(X_test) # predict the outputs of the X_test instances, in the form of probabilities (hence predict_proba)
for i in xrange(len(prob_predictions)): # iterate through all the predictions (equal in size to test set)
    # create a discrete decision for the tumor being malignant or benign, using 0.5 as the lowest probability needed for a predicted malignancy (general rounding)
    # as discussed before, our network actually outputs [probability_of_benign, probability_of_malignant], so we will want to
    # fetch the probability_of_malignant value and round this one (that's how it would be for a single output network if it worked!)
    discrete_prediction = 0 if prob_predictions[i][1] < 0.5 else 1
	if not y_test[i] == discrete_prediction: # if the actual, correct value for this test tumor does not equal the discrete prediction 
		error_count += 1.0 # increment the number of errors
        
error_rate = error_count / len(prob_predictions) * 100 # get the percentage of errors by dividing total errors by number of instances, multiplying by 100 
print error_count # print number of raw errors
print str(error_rate) + "%" # output this error percentage

My program produced 8 errors in total, with an error rate of 3.9%. This means my neural network had a success/accuracy rate of 96.1%. This is pretty good! In the future, we can make it even better. We could introduce more complex models with a greater number of hidden layers, or we could perform pre-processing on the input eg. by using normalization. We may also want to apply regularization, and ensure that our model is not too complex for the data (we could use more test data to do that - but right now it looks good!). Lastly, we may want to look at each individual error and try to minimize false negatives (we predict a patient does not have a malignant tumor when they do - risky business!) over false positives.

There are many other things we can do from here, and this practical example demonstrated the power of neural networks and how we can use an API like scikit-neuralnetwork to get one up and running in no time. Hope y'all had fun!

	ID	CT	UCS	UCSh	MA	SECS	BN	BC	NN	M	Class
0	1000025	5	1	1	1	2	1	3	1	1	2
1	1002945	5	4	4	5	7	10	3	2	1	2
2	1015425	3	1	1	1	2	2	3	1	1	2
3	1016277	6	8	8	1	3	4	3	7	1	2
4	1017023	4	1	1	3	2	1	3	1	1	2
5	1017122	8	10	10	8	7	10	9	7	1	4
6	1018099	1	1	1	1	2	10	3	1	1	2
7	1018561	2	1	2	1	2	1	3	1	1	2
8	1033078	2	1	1	1	2	1	1	1	5	2
9	1033078	4	2	1	1	2	1	2	1	1	2