Artificial Neural Networks

Artificial neural networks (ANNs) are computational models, inspired by the way the brain works, that are capable of machine learning and pattern recognition. The main advantage of ANNs is that they can learn complex non-linear hypotheses. They are usually presented as systems of interconnected "neurons" that can compute values from inputs by feeding information through the network.

A single neuron can be modeled as a "logistic unit," where a series of features and associated weights are passed through a sigmoid function. Neurons can also be modeled using other acitivation functions. The perceptron is a well-known type of artificial neuron that uses a threshold activation function. The output "hypothesis" of the artificial neuron is either 0 or 1, and for a logistic unit it is also known as the Sigmoid (logistic) activation function.

A neural network is a group of neurons (here, logistic units) connected in layers. Traditionally, the first layer is called the "input layer" and the last layer the "output layer," while any layers in between are called "hidden layers." Each logistic unit in a layer has the same inputs, which are all the outputs from the previous layer, plus a bias input. The network has one neuron in the output layer for each possible classification class. The more hidden layers there are, the more complex (i.e. non-linear) relationships the network can learn.

Each neuron uses a linear output that is the weighted sum of its input. Initially, before training, the weights are set the random values. Then, an initial prediction from the neural network is computed by "forward propagation," where the output of neurons at earlier layers are used as the input to neurons at later ones. Since the input weights are initially random, the network will compute an output that very likely differs from the actual output. A common method for measuring the discrepancy between the expected output and the actual output is the squared error measure (the square of the difference). Given this measure of error, the problem of mapping inputs to outputs can be reduced to an optimization problem of finding a function (combination of weights) that will produce the minimal error. As the output of a neuron depends on the weighted sum of all its inputs, the error also depends on the incoming weights to the neuron. Thus, the minimization of the error is computed by the "backpropagation" algorithm, which propagates the error backwards through the network layers. As the error is propagated backwards, small changes are made to the weights in each layer that are calculated to reduce the error. The whole process proceeds iteratively over each instance in the dataset, and can be repeated until the overall error value drops below some pre-determined threshold.

Here, we generate an ANN with one hidden layer.


In [1]:
# (c) 2014 Reid Johnson
#
# Modified from:
# Daniel Rodriguez (http://danielfrg.com/blog/2013/07/03/basic-neural-network-python/)
#
# Functions to generate an artifical neural network (ANN) with one hidden 
# layer. The ANN is a collection of artificial neurons arranged in layers 
# that are trained using backpropogation.
#
# The perceptron algorithm is defined by Rosenblatt in:
# The Perceptron: A Probalistic Model for Information Storage and Organization in the Brain.
# Psychological Review 65 (6): 386–408. 1958.
#
# The backpropogation algorithm is defined by Werbos in:
# Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. 1975.

import numpy as np
from scipy import optimize
from __future__ import division

class NN_1HL(object):
    def __init__(self, reg_lambda=0, epsilon_init=0.12, hidden_layer_size=25, opti_method='TNC', maxiter=500):
        self.reg_lambda = reg_lambda # weight for the logistic regression cost
        self.epsilon_init = epsilon_init # learning rate
        self.hidden_layer_size = hidden_layer_size # size of the hidden layer
        self.activation_func = self.sigmoid # activation function
        self.activation_func_prime = self.sigmoid_prime # derivative of the activation function
        self.method = opti_method # optimization method
        self.maxiter = maxiter # maximum number of iterations

    def sigmoid(self, z):
        ''' Logistic function.

        Returns: 
          The logistic function.

        '''
        return 1 / (1 + np.exp(-z))

    def sigmoid_prime(self, z):
        ''' Derivative of the logistic function.

        Returns: 
          The derivative of the logistic function.

        '''
        sig = self.sigmoid(z)

        return sig * (1-sig)

    def sumsqr(self, a):
        ''' Sum of squared values.

        Returns: 
          The sum of squared values.

        '''
        return np.sum(a**2)

    def rand_init(self, l_in, l_out):
        ''' Generates an (l_out x l_in+1) array of thetas (threshold 
        values), each initialized to a random number between minus 
        epsilon and epsilon.

        Note that there is one theta matrix per layer. The size of 
        each theta matrix depends on the number of activation units 
        in its corresponding layer, so each matrix may be of a 
        different size.

        Returns:
          Randomly initialized thetas (threshold values).

        '''
        return np.random.rand(l_out, l_in+1) * 2 * self.epsilon_init - self.epsilon_init

    # Pack thetas (threshold values) into a one-dimensional array.
    def pack_thetas(self, t1, t2):
        ''' Packs (unrolls) thetas (threshold values) from matrices 
        into a single vector.

        Note that there is one theta matrix per layer. To use an 
        optimization technique that minimizes the error, we need to 
        pack (unroll) the matrices into a single vector.

        Args:
          t1 (array): Unpacked (rolled) thetas (threshold values) between the input layer and hidden layer.
          t2 (array): Unpacked (rolled) thetas (threshold values) between the hidden layer and output layer.

        Returns:
          Packed (unrolled) thetas (threshold values).

        '''
        return np.concatenate((t1.reshape(-1), t2.reshape(-1)))

    def unpack_thetas(self, thetas, input_layer_size, hidden_layer_size, num_labels):
        ''' Unpacks (rolls) thetas (threshold values) from a single
        vector into a multi-dimensional array (matrices).

         Args:
          thetas (vector): Packed (unrolled) thetas (threshold values).
          input_layer_size (int): Number of nodes in the input layer.
          hidden_layer_size (int): Number of nodes in the hidden layer.
          num_labels (int): Number of classes.

        Returns:
          t1 (array): Unpacked (rolled) thetas (threshold values) between the input layer and hidden layer.
          t2 (array): Unpacked (rolled) thetas (threshold values) between the hidden layer and output layer.

        '''
        t1_start = 0
        t1_end = hidden_layer_size * (input_layer_size + 1)
        t1 = thetas[t1_start:t1_end].reshape((hidden_layer_size, input_layer_size + 1))
        t2 = thetas[t1_end:].reshape((num_labels, hidden_layer_size + 1))

        return t1, t2

    def _forward(self, X, t1, t2):
        ''' Forward propogation.

        Args:
          X ((n,m) array): The feature data for which to compute the predicted output probabilities.
          t1 (array): Unpacked (rolled) thetas (threshold values) between the input layer and hidden layer.
          t2 (array): Unpacked (rolled) thetas (threshold values) between the hidden layer and output layer.

        Returns:
          a1: The output activation of units in the input layer.
          z2: The input activation of units in the hidden layer.
          a2: The output activation of units in the hidden layer.
          z3: The input activation of units in the output layer.
          a3: The output activation of units in the output layer.

        '''
        m = X.shape[0]
        ones = None
        if len(X.shape) == 1:
            ones = np.array(1).reshape(1,)
        else:
            ones = np.ones(m).reshape(m,1)

        # Input layer.
        a1 = np.hstack((ones, X))

        # Hidden Layer.
        z2 = np.dot(t1, a1.T)
        a2 = self.activation_func(z2)
        a2 = np.hstack((ones, a2.T))

        # Output layer.
        z3 = np.dot(t2, a2.T)
        a3 = self.activation_func(z3)

        return a1, z2, a2, z3, a3

    def function(self, thetas, input_layer_size, hidden_layer_size, num_labels, X, y, reg_lambda):
        ''' Generates the cost using a generalization of the regularized logistic regression cost function. 

        Returns:
          J: the cost vector for each output unit.

        '''
        t1, t2 = self.unpack_thetas(thetas, input_layer_size, hidden_layer_size, num_labels)

        m = X.shape[0]
        Y = np.eye(num_labels)[y]

        _, _, _, _, h = self._forward(X, t1, t2)
        costPositive = -Y * np.log(h).T # the cost when y is 1
        costNegative = (1 - Y) * np.log(1 - h).T # the cost when y is 0
        cost = costPositive - costNegative # the total cost
        J = np.sum(cost) / m # the (unregularized) cost function

        # For regularization.
        if reg_lambda != 0:
            t1f = t1[:, 1:]
            t2f = t2[:, 1:]
            reg = (self.reg_lambda / (2*m)) * (self.sumsqr(t1f) + self.sumsqr(t2f)) # regularization term
            J = J + reg # the regularized cost function

        return J

    def function_prime(self, thetas, input_layer_size, hidden_layer_size, num_labels, X, y, reg_lambda):
        ''' Generates the objective function to minimize based upon the cost function. 

        Returns:
          Packed (unrolled) gradients for each theta.

          '''
        t1, t2 = self.unpack_thetas(thetas, input_layer_size, hidden_layer_size, num_labels)

        m = X.shape[0]
        t1f = t1[:, 1:] # threshold values between the input layer and hidden layer (excluding the bias input)
        t2f = t2[:, 1:] # threshold values between the hidden layer and output layer (excluding the bias input)
        Y = np.eye(num_labels)[y]

        Delta1, Delta2 = 0, 0 # initialize matrix Deltas (cost function gradients)
        # Iterate over the instances.
        for i, row in enumerate(X):
            a1, z2, a2, z3, a3 = self._forward(row, t1, t2)

            # Backprop.
            d3 = a3 - Y[i, :].T # activation error (delta) in the output layer nodes
            d2 = np.dot(t2f.T, d3) * self.activation_func_prime(z2) # activation error (delta) in the hidden layer nodes

            # Update matrix Deltas (cost function gradients).
            Delta2 += np.dot(d3[np.newaxis].T, a2[np.newaxis])
            Delta1 += np.dot(d2[np.newaxis].T, a1[np.newaxis])

        # The (unregularized) gradients for each theta.
        Theta1_grad = (1 / m) * Delta1 
        Theta2_grad = (1 / m) * Delta2

        # For regularization.
        if reg_lambda != 0:
            # The regularized gradients for each theta (excluding the bias input).
            Theta1_grad[:, 1:] = Theta1_grad[:, 1:] + (reg_lambda / m) * t1f
            Theta2_grad[:, 1:] = Theta2_grad[:, 1:] + (reg_lambda / m) * t2f

        return self.pack_thetas(Theta1_grad, Theta2_grad)

    def fit(self, X, y):
        ''' Fit the model given predictor(s) X and target y.

        Args:
          X ((n,m) array): The feature data for which to compute the predicted output.
          y ((n,1) array): The actual outputs (class data).

        '''
        num_features = X.shape[0]
        input_layer_size = X.shape[1]
        num_labels = len(set(y))

        theta1_0 = self.rand_init(input_layer_size, self.hidden_layer_size)
        theta2_0 = self.rand_init(self.hidden_layer_size, num_labels)
        thetas0 = self.pack_thetas(theta1_0, theta2_0)

        options = {'maxiter': self.maxiter}
        _res = optimize.minimize(self.function, thetas0, jac=self.function_prime, method=self.method, 
                                 args=(input_layer_size, self.hidden_layer_size, num_labels, X, y, 0), options=options)

        self.t1, self.t2 = self.unpack_thetas(_res.x, input_layer_size, self.hidden_layer_size, num_labels)

    def predict(self, X):
        ''' Predict labels with the fitted model on predictor(s) X.

        Args:
          X ((n,m) array): The feature data for which to compute the predicted outputs.

        Returns:
          The predicted labels for each instance X.
    
        '''
        return self.predict_proba(X).argmax(0)

    def predict_proba(self, X):
        ''' Predict label probabilities with the fitted model on predictor(s) X.

        The probabilities are computed as the output activation of units in the output layer.

        Args:
          X ((n,m) array): The feature data for which to compute the predicted output probabilities.

        Returns:
          h: The predicted label probabilities for each instance X.

        '''
        _, _, _, _, h = self._forward(X, self.t1, self.t2)

        return h

Here, we will demonstrate ANNs by using the Iris flower dataset. Thus, we first load and perform some preprocessing on the data. The preprocessing involves altering the target or class variables, which in the Iris dataset are by default represented as strings (nominal values), but for compatibility reasons need to be represented as integers (numeric values). We perform this conversion using a label-encoding method available via scikit-learn.


In [2]:
import pandas as pd
from sklearn import cross_validation
from sklearn import preprocessing

label_encode = True
n_folds = 5

# Load the Iris flower dataset
fileURL = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = pd.read_csv(fileURL, 
                   names=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Species'],
                   header=None)
iris = iris.dropna()

X = np.array(iris[['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width']]) # features
y = iris['Species'] # class

if label_encode:
    # Transform string (nominal) output to numeric
    labels = preprocessing.LabelEncoder().fit_transform(y)
else:
    labels = y

# Generate k stratified folds of the data.
skf = list(cross_validation.StratifiedKFold(labels, n_folds))

Next, we train the ANN classifier, as well as a Random Forest classifier for comparison. We use crossfold-validation to train the classifiers, which we evaluate using accuracy. For each classifier, the accuracy over all folds are averaged to generate the classifier's overall accuracy. We also use the timeit magic function build-in to IPython Notebook to track the training time for each classifier during each fold.


In [3]:
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier

# Generate classification objects.
nn = NN_1HL()
rfc = RandomForestClassifier(n_estimators=50)

# Generate arrays for meta-level training and testing sets, which are n x len(clfs).
scores_nn = np.zeros(n_folds) # scores for nn
scores_rfc = np.zeros(n_folds) # scores for rfc

print 'Training classifiers...'
# Iterate over the folds, each with training set and validation set indicies.
for i, (train_index, test_index) in enumerate(skf):
    print '  Fold [%s]' % (i)

    # Generate the training set for the fold.
    X_train = X[train_index]
    y_train = labels[train_index]

    # Generate the testing set for the fold.
    X_test = X[test_index]
    y_test = labels[test_index]

    # Train the models on the training set.
    # We time the training using the built-in timeit magic function.
    print '    Neural Network: ',
    %timeit nn.fit(X_train, y_train)
    print '    Random Forest: ',
    %timeit rfc.fit(X_train, y_train)

    # Evaluate the models on the testing set.
    scores_nn[i] = metrics.accuracy_score(y_test, nn.predict(X_test))
    scores_rfc[i] = metrics.accuracy_score(y_test, rfc.predict(X_test))
print 'Done training classifiers.'
print

# The mean of the scores on the testing set.
print 'Artificial Neural Network Accuracy = %s' % (scores_nn.mean(axis=0))
print 'Random Forest Accuracy = %s' % (scores_rfc.mean(axis=0))


Training classifiers...
  Fold [0]
    Neural Network:  1 loops, best of 3: 2.12 s per loop
    Random Forest:  10 loops, best of 3: 46.6 ms per loop
  Fold [1]
    Neural Network:  1 loops, best of 3: 2.3 s per loop
    Random Forest:  10 loops, best of 3: 34.1 ms per loop
  Fold [2]
    Neural Network:  1 loops, best of 3: 1.82 s per loop
    Random Forest:  10 loops, best of 3: 52.8 ms per loop
  Fold [3]
    Neural Network:  1 loops, best of 3: 2.63 s per loop
    Random Forest:  10 loops, best of 3: 52 ms per loop
  Fold [4]
    Neural Network:  1 loops, best of 3: 2.3 s per loop
    Random Forest:  10 loops, best of 3: 40.1 ms per loop
Done training classifiers.

Artificial Neural Network Accuracy = 0.946666666667
Random Forest Accuracy = 0.933333333333

Notice that even for this small dataset, an artificial neural network with one hidden layer takes a comparatively long time to train. While the implementation is undoubtly less optimized than algorithms available via scikit-learn, the current artificial neural network models are known to be comparatively slow. At the same time, the are also known for generating comparatively accurate predictions.