The Multi-Layer Perceptron


In [ ]:
import numpy as np
import pandas
  • Perceptron is a linear model
  • Why?
  • Inherently simple - two solutions.
    • Transform the features to make the problem linearly separable
    • Make the network more complex
  • The multi-layer perceptron builds on the second idea
  • Learning with the perceptron occurs in the weights
  • To make a more complex network, add more weights
  • Two ways to do so
    • Add some backward connections - so neurons connect to inputs again
    • Add more neurons
  • The first approach leads to recurrent networks
  • The second approach leads to MLP
  • It will allow us to add more "layers" of nuerons between the input nodes and the outputs

Show that we can solve the XOR Problem

  • Worked in class

Going Forward

  • Going forward means working out the outputs for given inputs
  • This is the recall phase we discussed last time
  • This is the same in the MLP as the Perceptron
  • Only we do it twice now - layer-by-layer
  • The activations of one layer of nodes are the inputs to the next

Going Backwards: Back-Propagation of Error

  • Computing the errors is the same
  • What to do with them - how to update the weights - is more difficult
  • This is called back-propagation of error
  • We do so through gradient descent
  • We know we need to update the weights, but which ones? Inputs $\rightarrow$ hidden layer or hidden layer $\rightarrow$ outputs?
  • Recall that our error function in the Perceptron was $E=t-y$
  • For the MLP we will use the familiar sum of squares error $E(t,y)=\frac{1}{2}\sum_{k=1}^n(t_k-y_k)^2$

Activation Function

  • We also need to revisit our activation function
  • For the Perceptron, we used a binary activation function
  • This is not differentiable
  • For the MLP, we will use a sigmoid function
  • You have seen an exmaple in the logistic function with the now-familiar S-shape $$g(h)=\frac{1}{1+e^{-\beta h}}$$

where $\beta > 0$


In [ ]:
fig, ax = plt.subplots()
g_func = lambda h, beta : 1/(1 + np.exp(-beta*h))
x = np.linspace(-3, 3, 100)

ax.plot(x, g_func(x, 2))
ax.hlines([0, 1], -3, 3, color="r")
ax.set_ylim(-.1, 1.1);
ax.set_title(r"$g(h;\, \beta)$");

Back-Propagation

  • We feed our inputs forward through the network, which tells us which nodes are firing
  • We compute the errors
  • We need to compute the gradient of the errors with respect to the weights
  • This allows us to update the weights in the "downhill" direction
    • Do this for the nodes connected to the output layer
    • Work backwards until we get back to the inputs again
  • Two problems
    • We don't know the inputs to the output neurons
    • We don't know the targets for the hidden neurons (for multiple hidden layers - don't know the inputs or the outputs...)
  • We can use the chain-rule from calculus to get around this

MLP Algorithm Overview

  • An input vector is put into the input nodes
  • Inputs are fed forward through the network
    • Inputs and first-layer weights $v$ determine wither the hidden nodes fire via $g(\cdot)$
    • Outputs of these nodes and the second-layer weights are used to decide if the output neurons fire
  • Sum-of-squares error is computed
  • Error is fed backwards through the network
    • Second-layer weights are updated (using $\delta_0$)
    • First-layer weights are updated (using $\delta_h$)

MLP Algorithm

  • Initialization of weights
  • Training (Repeat)

    • For each input vector

      Forwards phase

      • Compute the activation of each neuron $j$ in the hidden layer(s) using

      $$h_j=\sum_ix_iv_{ij}$$ $$a_j=g(h_j)=\frac{1}{1+e^{-\beta h_j}}$$


      • Work through network until you get the output layers


      $$h_k=\sum_j{a_jw_{jk}}$$ $$y_k=g(h_k)=\frac{1}{1+e^{-\beta h_k}}$$

      Backwards phase

      • compute the error at the output using

        $$\delta_{ok}=(t_k-y_k)y_k(1-y_k)$$

      • compute the error in the hidden layer(s) using

        $$\delta_{hj}=a_j(1-a_j)\sum_k w_{jk}\delta_{ok}$$

      • update the output layer weights using

        $$w_{jk}\leftarrow w_{jk} + \eta \delta_{ok}a_j$$

      • update the hidden layer weights using

        $$v_{ij}\leftarrow v_{ij}+\eta \delta_{hj}x_i$$

      Randomize the order of the input vectors so that you don't train in exactly the same order each iteration

  • Recall
    • Use the Forwards Phase

Implementation

Batch vs. Online algorithms

  • The implementation below is what's called a batch algorithm
  • This means that the weights are only update after all the training inputs have been seen
  • The weights are updated once for each epoch (pass through training examples)
  • The gradient estimation will be more accurate and will thus converge to the local minimum more quickly
  • The algorithm described above is an online algorithm
  • It is sequential
  • An online algorithm would update the weights incrementally over the training inputs
  • Online algorithms have a few advantages
    • They can be more efficient in terms of memory use
    • They can stop "early"
    • They can avoid local minimum by using less accurate gradients
  • Online algorithms are not always (easily) available

Initial weights

  • Each neuron gets an input from $n$ different places (inputs nodes or hidden neurons)
  • If they all have the same variance, then the typical input for each neuron is $w\sqrt{n}$
  • A common method is then to set the weights so that $\frac{-1}{n}<w<\frac{1}{n}$
  • This is called uniform learning

Using different activation functions

  • Regression problems: linear activation
$$y_k = g(h_k)=h_k$$
  • This needs a new delta term for the update step
$$\delta_{ok}=(t_k-y_k)$$
  • $1\mbox{-of-}N$ output encoding: soft-max
  • $1\mbox{-of-}N$ output encoding is used when the output variable can take on more than two values
  • For example, modeling the decision of transportation, "bus", "car", or "bike" might be encoded

 

[[1, 0, 0],
 [0, 1, 0],
 [0, 0, 1]]

  • The soft-max function is the same function that appears in Multinomial Logit problems in statistics
$$y_k = g(h_k) = \frac{e^{h_k}}{\sum_{h_k}e^{h_k}}$$

Local Minima

  • As we saw when discussing optimization, local minima can be a problem
  • This is also the case for gradient descent
  • The problem is exacerbated by the higher dimensionality
  • One way to try to overcome getting stuck in local minima is by picking up momentum
  • Momentum also allows us to use a smaller (and thus more stable) learning rate $\eta$
  • We add momentum to the weights updating as follows
$$w_{ij}^t\leftarrow w_{ij}^{t-1}+\eta \delta_0 a_j + \alpha \Delta w_{ij}^{t-1}$$

where $0 < \alpha < 1$ is the momentum constant


In [ ]:
np.set_printoptions(suppress=True)

In [ ]:
from perceptron import Perceptron, add_bias_node

Functions for internal use to avoid spaghetti code


In [ ]:
def _linear_delta(targets, outputs, nobs):
    return (targets - outputs)/nobs

def _logistic_delta(targets, outputs, *args):
    return (targets - outputs)*outputs

def _softmax_delta(targets, outputs, nobs):
    return (targets - outputs)/nobs

_calc_deltao = {
        "linear" : _linear_delta,
        "logistic" : _logistic_delta,
        "softmax" : _softmax_delta
        }

def _linear_activation(outputs, *args):
    return outputs

def _logistic_activation(outputs, beta, *args):
    return 1/(1+np.exp(-beta*outputs))

def _softmax_activation(outputs, *args):
    # this is multinomial logit
    eX = np.exp(outputs)
    return eX/eX.sum(axis=1)[:,None]


_activation_funcs = {
        "linear" : _linear_activation,
        "logistic" : _logistic_activation,
        "softmax" : _softmax_activation,
        }

In [ ]:
from perceptron import Perceptron, add_bias_node

class MLP(Perceptron):
    """
    A Multi-Layer Perceptron
    """
    def __init__(self, nhidden, eta, beta=1, momentum=0.9, outtype='logistic'):
        # Set up network size
        self.nhidden = nhidden
        self.eta = eta

        self.beta = beta
        self.momentum = momentum
        self.outtype = outtype


    def _init_weights(self):
        # Initialise network
        weights1 = np.random.rand(self.m+1, self.nhidden)-0.5
        weights1 *= 2/np.sqrt(self.m)
        weights2 = np.random.rand(self.nhidden+1,self.n)-0.5
        weights2 *= 2/np.sqrt(self.nhidden)

        self.weights1 = weights1
        self.weights2 = weights2

    def earlystopping(self, inputs, targets, valid_input, valid_target,
                            max_iter=100, epsilon=1e-3, disp=True):

        self._initialize(inputs, targets)
        valid_input = add_bias_node(valid_input)

        # 2 iterations ago, last iteration, current iteration
        # current iteration, last iteration, 2 iterations ago
        last_errors = [0, np.inf, np.inf]

        count = 0

        while np.any(np.diff(last_errors) > epsilon):
            count += 1
            if disp:
                print count

            # train the network
            self.fit(inputs, targets, max_iter, init=False, disp=disp)
            last_errors[2] = last_errors[1]
            last_errors[1] = last_errors[0]

            # check on the validation set
            valid_output = self.predict(valid_input, add_bias=False)
            errors = valid_target - valid_output
            last_errors[0] = 0.5*np.sum(errors**2)
        
        if disp:
            print "Stopped in %d iterations" % count, last_errors
        return last_errors[0]

    def fit(self, inputs, targets, max_iter, disp=True, init=True):
        """
        Train the network

        Parameters
        ----------
        inputs : array-like
            The inputs data
        targets : array-like
            The targets to train on
        max_iter : int
            The number of iterations to perform
        init : bool
            Whether to initialize the weights or not.
        """
        if init:
            self._initialize(inputs, targets)
        inputs = self.inputs
        targets = self.targets
        weights1 = self.weights1
        weights2 = self.weights2
        eta = self.eta
        momentum = self.momentum
        nobs = self.nobs

        outtype = self.outtype

        # Add the inputs that match the bias node
        inputs = add_bias_node(inputs)
        change = range(self.nobs)

        updatew1 = np.zeros_like(weights1)
        updatew2 = np.zeros_like(weights2)


        for n in range(1, max_iter+1):

            # predict attaches hidden
            outputs = self.predict(inputs, add_bias=False)

            error = targets - outputs
            obj = .5 * np.sum(error**2)

            # Different types of output neurons
            deltao = _calc_deltao[outtype](targets, outputs, nobs)
            hidden = self.hidden
            deltah = hidden * (1. - hidden) * np.dot(deltao, weights2.T)

            updatew1 = (eta*(np.dot(inputs.T, deltah[:,:-1])) +
                        momentum*updatew1)
            updatew2 = (eta*(np.dot(self.hidden.T, deltao)) +
                        momentum*updatew2)
            weights1 += updatew1
            weights2 += updatew2

            # Randomise order of inputs
            np.random.shuffle(change)
            inputs = inputs[change,:]
            targets = targets[change,:]

        if disp:
            print "Iteration: ", n, " Objective: ", obj

        # attach results
        self.weights1 = weights1
        self.weights2 = weights2
        self.outputs = outputs

    def predict(self, inputs=None, add_bias=True):
        """
        Run the network forward.
        """
        if inputs is None:
            inputs = self.inputs

        if add_bias:
            inputs = add_bias_node(inputs)

        hidden = np.dot(inputs, self.weights1)
        hidden = _activation_funcs["logistic"](hidden, self.beta)
        hidden = add_bias_node(hidden)
        self.hidden = hidden

        outputs = np.dot(self.hidden, self.weights2)

        outtype = self.outtype

        # Different types of output neurons
        return _activation_funcs[self.outtype](outputs, self.beta)

    def confusion_matrix(self, inputs, targets, summary=True):
        """
        Confusion matrix.
        """
        targets = np.asarray(targets)

        # Add the inputs that match the bias node
        inputs = add_bias_node(inputs)
        # predict attaches hidden
        outputs = self.predict(inputs, add_bias=False)

        n_classes = targets.ndim == 1 and 1 or targets.shape[1]

        if n_classes==1:
            n_classes = 2
            # 50% cut-off with continuous activation function
            outputs = np.where(outputs > 0.5, 1, 0)
        else:
            # 1-of-N encoding
            outputs = np.argmax(outputs, 1)
            targets = np.argmax(targets, 1)

        outputs = np.squeeze(outputs)
        targets = np.squeeze(targets)

        cm = np.histogram2d(targets, outputs, bins=n_classes)[0]

        if not summary:
            return cm
        else:
            return np.trace(cm)/np.sum(cm)*100

AND Example


In [ ]:
X = [[0., 0.],
     [0., 1.],
     [1., 0.],
     [1., 1.]]

In [ ]:
target = [0, 0, 0, 1]

In [ ]:
and_clf = MLP(2, .25)
and_clf.fit(X, target, 5001)

In [ ]:
and_clf.predict(X)

In [ ]:
and_clf.confusion_matrix(X, target, summary=False)

XOR Example


In [ ]:
target = [0., 1., 1., 0.]

In [ ]:
xor_clf = MLP(2, .25)
xor_clf.fit(X, target, 5001)

In [ ]:
xor_clf.predict()

Practical Considerations

  • Data Preparation
  • How much training data?
  • Number of hidden layers?
  • Overfitting
    • Stop before overfitting
    • How do we figure this out?

Training, Testing, and Validation

  • To evaluate our training, we may use the holdout method
  • We set aside some data from the training training set
  • This data is called the test set
  • It reduces the data available for training
  • We also want to try to evaluate how well the learning is going during training
  • For this we use a validation set (the same idea as cross-validation in statistics)
  • We might split the data into training:test:validation according to 50:25:25 or 60:20:20
  • This is called a three-way data split
  • You may need to be sure that the data is randomly ordered before splitting
  • All data preprocessing occurs before splitting

Other Resampling Methods

  • It is common to have too little data for a proper training-testing-validation split
  • We might also want to avoid getting a small error rate because of "bad" random split
  • At the expense of more computations, we may try cross-validation
    • Random Subsampling
    • K-Fold Cross-Validation
    • Leave-one-out Cross Validation
  • The idea is to randomly partition the dataset into $K$ subsets
  • Use one of the $K$ subsets as a validation set and train on the others
  • Do this again, leaving out another subset for validation, until all are left-out for validation
  • Use the model with the smallest validation error

When to Stop?

  • So far we have trained a network for a fixed number of iterations
  • This is never what you really want to do
  • You run the potential for both over- and under-fitting the data.
  • We can use the validation set to determine when to stop
  • Run the network for some fixed number of iterations and then evaluate on the validation set
  • Run for a few more iterations, then start over
  • At some point the error on the validation set will start to increase
  • This is when we stop. Unsurprisingly, this is called early stopping

Example: Regression

  • First let's generate some data
  • Then we'll use the MLP to try to uncover the function (or data generating process)

In [ ]:
np.random.seed(12345)
x = np.linspace(0, 1, 40)[:,None] # make 2D
t = (np.sin(2*np.pi*x) + 
     np.cos(4*np.pi*x) + 
     np.random.normal(0, .2, size=(40,1)))
x = (x - .5)*2

In [ ]:
fig, ax = plt.subplots()
ax.plot(x, t, 'o');

Split into 50:25:25


In [ ]:
train = x[::2]
test = x[1::4]
validate = x[3::4]

train_target = t[::2]
test_target = t[1::4]
validate_target = t[3::4]
  • We don't know how many hidden neurons, we'll need, so let's try 3.
  • We will run this for 100 iterations

In [ ]:
net = MLP(3, .25, outtype="linear")
net.fit(train, train_target, 100)
  • First let's decide how long to train the network

In [ ]:
net.earlystopping(train, train_target, validate, validate_target)
  • Now we need to figure out how to select the number of nodes we want
  • To find out, we can run each network size a number of times, say 50 and keep the error

In [ ]:
nhiddens = [1, 2, 3, 5, 10, 25, 50, 100]
all_errors = []
nruns = 10

for nhidden in nhiddens:
    errors = []
    mlp = MLP(nhidden, .25, outtype="linear")
    for i in range(nruns):
        error = mlp.earlystopping(train, train_target, validate, validate_target, disp=False)
        errors.append(error)
    all_errors.append(errors)

In [ ]:
all_errors = np.array(all_errors)
print "        n              mean          std            min            max"
print np.column_stack((nhiddens, all_errors.mean(1), all_errors.std(1),
                       all_errors.min(1), all_errors.max(1)))

scikit-learn for training, testing, and cross-validation


In [ ]:
from sklearn import cross_validation

a, b = np.arange(16).reshape((8, 2)), np.arange(8)

In [ ]:
print a
print b

In [ ]:
(a_train, 
 a_test, 
 b_train, 
 b_test) = cross_validation.train_test_split(a, b, 
                                     train_size=.75,
                                     random_state=123)

In [ ]:
a_train, b_train

In [ ]:
a_test, b_test

Aside: Python Generators

K-Folds Cross-Validation

  • The data is split into $K$ consecutive "folds"
  • Of the K subsamples, retain a single subsample as a validation set
  • Use the other k-1 subsamples as a training set
  • Gives indices to split the data into train and test sets
  • All observations are used as both training and test samples

In [ ]:
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = cross_validation.KFold(10, n_folds=5)

for train_index, test_index in kf:
    print "TRAIN:", train_index, "TEST:", test_index

In [ ]:
kf = cross_validation.KFold(10, n_folds=5, shuffle=True)

for train_index, test_index in kf:
    print "TRAIN:", train_index, "TEST:", test_index

Example: Classification


In [ ]:
from sklearn.datasets import load_iris

data = load_iris()

In [ ]:
print data.DESCR

You can read much more about the dataset here

Normalize the data


In [ ]:
# demean
X = data.data - data.data.mean(0)
# normalize by maximum
X /= X.max(0)

In [ ]:
data.target

We need to put this into 1-of-N encoding


In [ ]:
target = np.zeros((len(X), 3))
target[np.arange(len(X)),data.target] = 1

In [ ]:
target[:10]

We are going to use train_test_split to split the data up into training, testing, and validation sets


In [ ]:
(feature_train, feature_test,
 target_train, target_test) = cross_validation.train_test_split(X, target, test_size=.25)

In [ ]:
target_train.mean(0)

In [ ]:
target_validate.mean(0)

If you need to split based on maintaining the same percentage of the target labels, you can use StratifiedKFold

We can now split the training data into training and validation sets


In [ ]:
(feature_train, feature_validate,
 target_train, target_validate) = cross_validation.train_test_split(feature_train, 
                                                                    target_train, 
                                                                    test_size=.33)

In [ ]:
net = MLP(5, .1, outtype="softmax")
net.earlystopping(feature_train, target_train, feature_validate, 
                  target_validate, disp=False)

In [ ]:
net.confusion_matrix(feature_test, target_test)

In [ ]:
net.confusion_matrix(feature_test, target_test, summary=False)

Back-Propagation Derivation

  • The output, $y$, is a function of $x$, $g(\cdot)$, and the weights
  • The weights will be denoted $v$ and $w$ for the first and second layers
  • $i$ is the index over the input nodes, $j$ is the index over the hidden layer neurons, $k$ is the index over the output neurons
  • First let's write the error function
$$\begin{aligned}E(\boldsymbol{w}) & =\frac{1}{2}\sum_{k=1}^N\left(t_k-y_k\right)^2 \cr & = \frac{1}{2}\sum_k{\left[t_k-g\left(\sum_j w_{jk}a_j \right)\right]^2}\end{aligned}$$

where $a_j$ is the output from the hidden layer neurons

  • For the moment, let's ignore the hidden layer and work with the perceptron
$$\begin{aligned}E(\boldsymbol{w}) & =\frac{1}{2}\sum_{k=1}^N\left(t_k-y_k\right)^2 \cr & = \frac{1}{2}\sum_k{\left[t_k-g\left(\sum_j w_{jk}x_j \right)\right]^2}\end{aligned}$$
  • Since we will use gradient descent, unsurprisingly, we will need the gradient
  • Let's remind ourselves, what is the gradient again?
  • Recall that $g$ was the binary activation function and, thus, not differentiable, so we ignore it in the below
  • We adjust the weights to reduce the errors, thus we need
$$\begin{aligned}\frac{\partial E}{\partial w_{ik}} & = \frac{\partial}{\partial w_{ik}}\left(\frac{1}{2}\sum_k\left(t_k-\sum_j w_{jk}x_j\right)^2\right) \cr & = \frac{1}{2}\sum_k2(t_k-y_k)\frac{\partial}{\partial w_{ik}}\left(t_k-\sum_j w_{jk}x_j\right) \cr \end{aligned} $$

note that

$$\frac{\partial t_k}{\partial w_{ik}}=0$$

and

$\frac{\partial}{\partial w_{ik}}\sum_jw_{jk}x_j$

is only non-zero when $i=j$, thus

$$\frac{\partial}{\partial w_{ik}}=x_i$$

so that we have

$$\begin{aligned}\frac{\partial E}{\partial w_{ik}} & = \sum_k(t_k-y_k)\left(-x_i\right) \end{aligned} $$

To make our errors smaller, we follow the gradient "downhill" such that (including the learning rate)

$$w_{ik}\leftarrow w_{ik}+\eta(t_k-y_k)x_i$$

Working with the Activation Function

  • From the above, it is clear that we need a differentiable $g$ to obtain $\frac{\partial g}{\partial w_{ik}}$ $$a = g(h)=\frac{1}{1+e^{-\beta h}}$$
  • The derivative of $g$ has a simple form $$\begin{aligned} g^{\prime}(h) & = \frac{d}{dh}\frac{1}{1+e^{-\beta h}} \cr & = \frac{d}{dh}(1+e^{-\beta h})^{-1} \cr & = -(1+e^{-\beta h})^{-2}(-\beta e^{-\beta h}) \cr & = \frac{\beta e^{-\beta h}}{(1+e^{-\beta h})^{2}} \cr & = \beta\frac{1}{1+e^{-\beta h}}\frac{ e^{-\beta h}}{1+e^{-\beta h}} \cr & = \beta\frac{1}{1+e^{-\beta h}}\left(\frac{1 + e^{-\beta h} - 1}{1+e^{-\beta h}}\right) \cr & = \beta g(h)(1-g(h)) \cr & = \beta a(1-a) \cr \end{aligned} $$

Back-Propagation of Error

  • To use gradient-descent in the MLP, we need the partials of the errors with respect to each weight
$$\frac{\partial E}{\partial w_{jk}}=\frac{\partial E}{\partial h_k}\frac{\partial h_k}{\partial w_{jk}}$$

where $h_k=\sum_lw_{lk}a_l$ is the input to output-layer neuron $k$

  • Ie., this is the weighted average of the activations of the hidden layer neurons, using second-layer weights
  • Taking the second part of partials, we have
$$\begin{aligned} {\partial h_k}\frac{\partial h_k}{\partial w_{jk}} & =\frac{\partial \sum_l w_{lk}a_l}{\partial w_{jk}} \cr & = \sum_l\frac{\partial w_{lk}a_l}{\partial w_{jk}} \cr & = a_j \end{aligned} $$
  • The first term is referred to as the error or delta term
$$\delta_0 = \frac{\partial E}{\partial h_k}$$
  • We need to unpack this, because we don't know the inputs just the outputs
$$\delta_0 = \frac{\partial E}{\partial h_k}= \frac{\partial E}{\partial y_k}\frac{\partial y_k}{\partial h_k}$$

where the output of the output-layer neuron $k$

$$y_k = g(h_k)=g\left(\sum_j w_{jk}a_j\right)$$

Plugging-in the derivatives we already have for $\delta_0$ gives

$$\begin{aligned} \delta_0 & = \frac{\partial E}{\partial g(h_k)}\frac{\partial g(h_k)}{\partial h_k} \cr & = \frac{\partial}{\partial g(h_k)}\left[\frac{1}{2}\sum_k{\left(t_k-g\left(\sum_j w_{jk}x_j\right)\right)}\right] \frac{\partial g(h_k)}{\partial h_k} \cr & = (g(h_k)-t_k)g^{\prime}(h_k) \cr & = (y_k - t_k)g^{\prime}(h_k) \end{aligned} $$

We already have $g^\prime(h_k)$ so we can put it all together to give the update step for the second-layer weights

$$w_{jk}\leftarrow w_{jk} - \eta \frac{\partial E}{\partial w_{jk}}$$

The missing piece is

$$\begin{aligned} \frac{\partial E}{\partial w_{jk}} &= \delta_0a_j \cr &= (y_k - t_k)y_k(1-y_k)a_j \end{aligned}$$

First layer weights

  • Now we need the first layer weights - the hidden weights $v_{jk}$
  • Remember we are going back propagation
$$\begin{aligned} \delta_h &= \sum_k \frac{\partial E}{\partial h_k} \frac{\partial h_k}{\partial h_j} \cr &= \sum_k\delta_0 \frac{\partial h_k}{\partial h_j} \end{aligned}$$
  • One thing to keep in mind, inputs to the output layer neurons come from the activation of the weighted average hidden layer neurons, using the second-layer weights
$$h_k = g(\sum_lw_{lk}h_l)$$

and

$$\frac{\partial h_k}{\partial h_j}=\frac{\partial g(\sum_l w_{lk}h_l)}{h_j}$$

Noting that $\frac{\partial h_i}{\partial h_j} = 0$ if $i\neq l$

$$\begin{aligned} \frac{\partial h_k}{\partial h_j}&=w_{jk}g^{\prime}(a_j) \cr &=w_{jk}a_j(1-a_j) \end{aligned}$$

This gives a delta term

$$\delta_h = a_j(1-a_j)\sum_k \delta_0 w_{jk}$$

So the update rule $v_{ij}\leftarrow v_{ij} - \eta\frac{\partial E}{\partial v_{ij}}$ needs

$$\frac{\partial E}{\partial v_{ij}} = a_j(1-a_j)(\sum_k \delta_0w_{jk})x_i$$