The Multi-Layer Perceptron



In [ ]:

    
import numpy as np
import pandas

Perceptron is a linear model
Why?
Inherently simple - two solutions.
- Transform the features to make the problem linearly separable
- Make the network more complex
The multi-layer perceptron builds on the second idea

Learning with the perceptron occurs in the weights
To make a more complex network, add more weights
Two ways to do so
- Add some backward connections - so neurons connect to inputs again
- Add more neurons
The first approach leads to recurrent networks
The second approach leads to MLP
It will allow us to add more "layers" of nuerons between the input nodes and the outputs

Show that we can solve the XOR Problem

Worked in class

Going Forward

Going forward means working out the outputs for given inputs
This is the recall phase we discussed last time
This is the same in the MLP as the Perceptron
Only we do it twice now - layer-by-layer
The activations of one layer of nodes are the inputs to the next

Going Backwards: Back-Propagation of Error

Computing the errors is the same
What to do with them - how to update the weights - is more difficult
This is called back-propagation of error
We do so through gradient descent
We know we need to update the weights, but which ones? Inputs $\rightarrow$ hidden layer or hidden layer $\rightarrow$ outputs?
Recall that our error function in the Perceptron was $E=t-y$
For the MLP we will use the familiar sum of squares error $E(t,y)=\frac{1}{2}\sum_{k=1}^n(t_k-y_k)^2$

Activation Function

We also need to revisit our activation function
For the Perceptron, we used a binary activation function
This is not differentiable
For the MLP, we will use a sigmoid function
You have seen an exmaple in the logistic function with the now-familiar S-shape $$g(h)=\frac{1}{1+e^{-\beta h}}$$

where $\beta > 0$



In [ ]:

    
fig, ax = plt.subplots()
g_func = lambda h, beta : 1/(1 + np.exp(-beta*h))
x = np.linspace(-3, 3, 100)

ax.plot(x, g_func(x, 2))
ax.hlines([0, 1], -3, 3, color="r")
ax.set_ylim(-.1, 1.1);
ax.set_title(r"$g(h;\, \beta)$");

Back-Propagation

We feed our inputs forward through the network, which tells us which nodes are firing
We compute the errors
We need to compute the gradient of the errors with respect to the weights
This allows us to update the weights in the "downhill" direction
- Do this for the nodes connected to the output layer
- Work backwards until we get back to the inputs again
Two problems
- We don't know the inputs to the output neurons
- We don't know the targets for the hidden neurons (for multiple hidden layers - don't know the inputs or the outputs...)
We can use the chain-rule from calculus to get around this

MLP Algorithm Overview

An input vector is put into the input nodes
Inputs are fed forward through the network
- Inputs and first-layer weights $v$ determine wither the hidden nodes fire via $g(\cdot)$
- Outputs of these nodes and the second-layer weights are used to decide if the output neurons fire
Sum-of-squares error is computed
Error is fed backwards through the network
- Second-layer weights are updated (using $\delta_0$)
- First-layer weights are updated (using $\delta_h$)

MLP Algorithm

Initialization of weights
Training (Repeat)
- For each input vector
  
  Forwards phase
  - Compute the activation of each neuron $j$ in the hidden layer(s) using
  $$h_j=\sum_ix_iv_{ij}$$ $$a_j=g(h_j)=\frac{1}{1+e^{-\beta h_j}}$$
  - Work through network until you get the output layers
  $$h_k=\sum_j{a_jw_{jk}}$$ $$y_k=g(h_k)=\frac{1}{1+e^{-\beta h_k}}$$
  
  Backwards phase
  - compute the error at the output using
    
    $$\delta_{ok}=(t_k-y_k)y_k(1-y_k)$$
  - compute the error in the hidden layer(s) using
    
    $$\delta_{hj}=a_j(1-a_j)\sum_k w_{jk}\delta_{ok}$$
  - update the output layer weights using
    
    $$w_{jk}\leftarrow w_{jk} + \eta \delta_{ok}a_j$$
  - update the hidden layer weights using
    
    $$v_{ij}\leftarrow v_{ij}+\eta \delta_{hj}x_i$$
  Randomize the order of the input vectors so that you don't train in exactly the same order each iteration

Recall
- Use the Forwards Phase

Implementation

Batch vs. Online algorithms

The implementation below is what's called a batch algorithm
This means that the weights are only update after all the training inputs have been seen
The weights are updated once for each epoch (pass through training examples)
The gradient estimation will be more accurate and will thus converge to the local minimum more quickly

The algorithm described above is an online algorithm
It is sequential
An online algorithm would update the weights incrementally over the training inputs
Online algorithms have a few advantages
- They can be more efficient in terms of memory use
- They can stop "early"
- They can avoid local minimum by using less accurate gradients
Online algorithms are not always (easily) available

Initial weights

Each neuron gets an input from $n$ different places (inputs nodes or hidden neurons)
If they all have the same variance, then the typical input for each neuron is $w\sqrt{n}$
A common method is then to set the weights so that $\frac{-1}{n}<w<\frac{1}{n}$
This is called uniform learning

Using different activation functions

Regression problems: linear activation

$$y_k = g(h_k)=h_k$$

This needs a new delta term for the update step

$$\delta_{ok}=(t_k-y_k)$$

$1\mbox{-of-}N$ output encoding: soft-max
$1\mbox{-of-}N$ output encoding is used when the output variable can take on more than two values
For example, modeling the decision of transportation, "bus", "car", or "bike" might be encoded

[[1, 0, 0],
 [0, 1, 0],
 [0, 0, 1]]

The soft-max function is the same function that appears in Multinomial Logit problems in statistics

$$y_k = g(h_k) = \frac{e^{h_k}}{\sum_{h_k}e^{h_k}}$$

Local Minima

As we saw when discussing optimization, local minima can be a problem
This is also the case for gradient descent
The problem is exacerbated by the higher dimensionality
One way to try to overcome getting stuck in local minima is by picking up momentum
Momentum also allows us to use a smaller (and thus more stable) learning rate $\eta$
We add momentum to the weights updating as follows

$$w_{ij}^t\leftarrow w_{ij}^{t-1}+\eta \delta_0 a_j + \alpha \Delta w_{ij}^{t-1}$$

where $0 < \alpha < 1$ is the momentum constant



In [ ]:

    
np.set_printoptions(suppress=True)



In [ ]:

    
from perceptron import Perceptron, add_bias_node

Functions for internal use to avoid spaghetti code



In [ ]:

    
def _linear_delta(targets, outputs, nobs):
    return (targets - outputs)/nobs

def _logistic_delta(targets, outputs, *args):
    return (targets - outputs)*outputs

def _softmax_delta(targets, outputs, nobs):
    return (targets - outputs)/nobs

_calc_deltao = {
        "linear" : _linear_delta,
        "logistic" : _logistic_delta,
        "softmax" : _softmax_delta
        }

def _linear_activation(outputs, *args):
    return outputs

def _logistic_activation(outputs, beta, *args):
    return 1/(1+np.exp(-beta*outputs))

def _softmax_activation(outputs, *args):
    # this is multinomial logit
    eX = np.exp(outputs)
    return eX/eX.sum(axis=1)[:,None]


_activation_funcs = {
        "linear" : _linear_activation,
        "logistic" : _logistic_activation,
        "softmax" : _softmax_activation,
        }



In [ ]:

    
from perceptron import Perceptron, add_bias_node

class MLP(Perceptron):
    """
    A Multi-Layer Perceptron
    """
    def __init__(self, nhidden, eta, beta=1, momentum=0.9, outtype='logistic'):
        # Set up network size
        self.nhidden = nhidden
        self.eta = eta

        self.beta = beta
        self.momentum = momentum
        self.outtype = outtype


    def _init_weights(self):
        # Initialise network
        weights1 = np.random.rand(self.m+1, self.nhidden)-0.5
        weights1 *= 2/np.sqrt(self.m)
        weights2 = np.random.rand(self.nhidden+1,self.n)-0.5
        weights2 *= 2/np.sqrt(self.nhidden)

        self.weights1 = weights1
        self.weights2 = weights2

    def earlystopping(self, inputs, targets, valid_input, valid_target,
                            max_iter=100, epsilon=1e-3, disp=True):

        self._initialize(inputs, targets)
        valid_input = add_bias_node(valid_input)

        # 2 iterations ago, last iteration, current iteration
        # current iteration, last iteration, 2 iterations ago
        last_errors = [0, np.inf, np.inf]

        count = 0

        while np.any(np.diff(last_errors) > epsilon):
            count += 1
            if disp:
                print count

            # train the network
            self.fit(inputs, targets, max_iter, init=False, disp=disp)
            last_errors[2] = last_errors[1]
            last_errors[1] = last_errors[0]

            # check on the validation set
            valid_output = self.predict(valid_input, add_bias=False)
            errors = valid_target - valid_output
            last_errors[0] = 0.5*np.sum(errors**2)
        
        if disp:
            print "Stopped in %d iterations" % count, last_errors
        return last_errors[0]

    def fit(self, inputs, targets, max_iter, disp=True, init=True):
        """
        Train the network

        Parameters
        ----------
        inputs : array-like
            The inputs data
        targets : array-like
            The targets to train on
        max_iter : int
            The number of iterations to perform
        init : bool
            Whether to initialize the weights or not.
        """
        if init:
            self._initialize(inputs, targets)
        inputs = self.inputs
        targets = self.targets
        weights1 = self.weights1
        weights2 = self.weights2
        eta = self.eta
        momentum = self.momentum
        nobs = self.nobs

        outtype = self.outtype

        # Add the inputs that match the bias node
        inputs = add_bias_node(inputs)
        change = range(self.nobs)

        updatew1 = np.zeros_like(weights1)
        updatew2 = np.zeros_like(weights2)


        for n in range(1, max_iter+1):

            # predict attaches hidden
            outputs = self.predict(inputs, add_bias=False)

            error = targets - outputs
            obj = .5 * np.sum(error**2)

            # Different types of output neurons
            deltao = _calc_deltao[outtype](targets, outputs, nobs)
            hidden = self.hidden
            deltah = hidden * (1. - hidden) * np.dot(deltao, weights2.T)

            updatew1 = (eta*(np.dot(inputs.T, deltah[:,:-1])) +
                        momentum*updatew1)
            updatew2 = (eta*(np.dot(self.hidden.T, deltao)) +
                        momentum*updatew2)
            weights1 += updatew1
            weights2 += updatew2

            # Randomise order of inputs
            np.random.shuffle(change)
            inputs = inputs[change,:]
            targets = targets[change,:]

        if disp:
            print "Iteration: ", n, " Objective: ", obj

        # attach results
        self.weights1 = weights1
        self.weights2 = weights2
        self.outputs = outputs

    def predict(self, inputs=None, add_bias=True):
        """
        Run the network forward.
        """
        if inputs is None:
            inputs = self.inputs

        if add_bias:
            inputs = add_bias_node(inputs)

        hidden = np.dot(inputs, self.weights1)
        hidden = _activation_funcs["logistic"](hidden, self.beta)
        hidden = add_bias_node(hidden)
        self.hidden = hidden

        outputs = np.dot(self.hidden, self.weights2)

        outtype = self.outtype

        # Different types of output neurons
        return _activation_funcs[self.outtype](outputs, self.beta)

    def confusion_matrix(self, inputs, targets, summary=True):
        """
        Confusion matrix.
        """
        targets = np.asarray(targets)

        # Add the inputs that match the bias node
        inputs = add_bias_node(inputs)
        # predict attaches hidden
        outputs = self.predict(inputs, add_bias=False)

        n_classes = targets.ndim == 1 and 1 or targets.shape[1]

        if n_classes==1:
            n_classes = 2
            # 50% cut-off with continuous activation function
            outputs = np.where(outputs > 0.5, 1, 0)
        else:
            # 1-of-N encoding
            outputs = np.argmax(outputs, 1)
            targets = np.argmax(targets, 1)

        outputs = np.squeeze(outputs)
        targets = np.squeeze(targets)

        cm = np.histogram2d(targets, outputs, bins=n_classes)[0]

        if not summary:
            return cm
        else:
            return np.trace(cm)/np.sum(cm)*100

AND Example



In [ ]:

    
X = [[0., 0.],
     [0., 1.],
     [1., 0.],
     [1., 1.]]



In [ ]:

    
target = [0, 0, 0, 1]



In [ ]:

    
and_clf = MLP(2, .25)
and_clf.fit(X, target, 5001)



In [ ]:

    
and_clf.predict(X)



In [ ]:

    
and_clf.confusion_matrix(X, target, summary=False)

XOR Example



In [ ]:

    
target = [0., 1., 1., 0.]



In [ ]:

    
xor_clf = MLP(2, .25)
xor_clf.fit(X, target, 5001)



In [ ]:

    
xor_clf.predict()

Practical Considerations

Data Preparation
How much training data?
Number of hidden layers?
Overfitting
- Stop before overfitting
- How do we figure this out?

Training, Testing, and Validation

To evaluate our training, we may use the holdout method
We set aside some data from the training training set
This data is called the test set
It reduces the data available for training
We also want to try to evaluate how well the learning is going during training
For this we use a validation set (the same idea as cross-validation in statistics)
We might split the data into training:test:validation according to 50:25:25 or 60:20:20
This is called a three-way data split
You may need to be sure that the data is randomly ordered before splitting
All data preprocessing occurs before splitting

Other Resampling Methods

It is common to have too little data for a proper training-testing-validation split
We might also want to avoid getting a small error rate because of "bad" random split
At the expense of more computations, we may try cross-validation
- Random Subsampling
- K-Fold Cross-Validation
- Leave-one-out Cross Validation
The idea is to randomly partition the dataset into $K$ subsets
Use one of the $K$ subsets as a validation set and train on the others
Do this again, leaving out another subset for validation, until all are left-out for validation
Use the model with the smallest validation error

When to Stop?

So far we have trained a network for a fixed number of iterations
This is never what you really want to do
You run the potential for both over- and under-fitting the data.
We can use the validation set to determine when to stop
Run the network for some fixed number of iterations and then evaluate on the validation set
Run for a few more iterations, then start over
At some point the error on the validation set will start to increase
This is when we stop. Unsurprisingly, this is called early stopping

Example: Regression

First let's generate some data
Then we'll use the MLP to try to uncover the function (or data generating process)



In [ ]:

    
np.random.seed(12345)
x = np.linspace(0, 1, 40)[:,None] # make 2D
t = (np.sin(2*np.pi*x) + 
     np.cos(4*np.pi*x) + 
     np.random.normal(0, .2, size=(40,1)))
x = (x - .5)*2



In [ ]:

    
fig, ax = plt.subplots()
ax.plot(x, t, 'o');

Split into 50:25:25



In [ ]:

    
train = x[::2]
test = x[1::4]
validate = x[3::4]

train_target = t[::2]
test_target = t[1::4]
validate_target = t[3::4]

We don't know how many hidden neurons, we'll need, so let's try 3.
We will run this for 100 iterations



In [ ]:

    
net = MLP(3, .25, outtype="linear")
net.fit(train, train_target, 100)

First let's decide how long to train the network



In [ ]:

    
net.earlystopping(train, train_target, validate, validate_target)

Now we need to figure out how to select the number of nodes we want
To find out, we can run each network size a number of times, say 50 and keep the error



In [ ]:

    
nhiddens = [1, 2, 3, 5, 10, 25, 50, 100]
all_errors = []
nruns = 10

for nhidden in nhiddens:
    errors = []
    mlp = MLP(nhidden, .25, outtype="linear")
    for i in range(nruns):
        error = mlp.earlystopping(train, train_target, validate, validate_target, disp=False)
        errors.append(error)
    all_errors.append(errors)



In [ ]:

    
all_errors = np.array(all_errors)
print "        n              mean          std            min            max"
print np.column_stack((nhiddens, all_errors.mean(1), all_errors.std(1),
                       all_errors.min(1), all_errors.max(1)))

scikit-learn for training, testing, and cross-validation

Documentation



In [ ]:

    
from sklearn import cross_validation

a, b = np.arange(16).reshape((8, 2)), np.arange(8)



In [ ]:

    
print a
print b



In [ ]:

    
(a_train, 
 a_test, 
 b_train, 
 b_test) = cross_validation.train_test_split(a, b, 
                                     train_size=.75,
                                     random_state=123)



In [ ]:

    
a_train, b_train



In [ ]:

    
a_test, b_test

Aside: Python Generators

K-Folds Cross-Validation

The data is split into $K$ consecutive "folds"
Of the K subsamples, retain a single subsample as a validation set
Use the other k-1 subsamples as a training set
Gives indices to split the data into train and test sets
All observations are used as both training and test samples



In [ ]:

    
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = cross_validation.KFold(10, n_folds=5)

for train_index, test_index in kf:
    print "TRAIN:", train_index, "TEST:", test_index



In [ ]:

    
kf = cross_validation.KFold(10, n_folds=5, shuffle=True)

for train_index, test_index in kf:
    print "TRAIN:", train_index, "TEST:", test_index

Example: Classification



In [ ]:

    
from sklearn.datasets import load_iris

data = load_iris()



In [ ]:

    
print data.DESCR

You can read much more about the dataset here

Normalize the data



In [ ]:

    
# demean
X = data.data - data.data.mean(0)
# normalize by maximum
X /= X.max(0)



In [ ]:

    
data.target

We need to put this into 1-of-N encoding



In [ ]:

    
target = np.zeros((len(X), 3))
target[np.arange(len(X)),data.target] = 1



In [ ]:

    
target[:10]

We are going to use train_test_split to split the data up into training, testing, and validation sets



In [ ]:

    
(feature_train, feature_test,
 target_train, target_test) = cross_validation.train_test_split(X, target, test_size=.25)



In [ ]:

    
target_train.mean(0)



In [ ]:

    
target_validate.mean(0)

If you need to split based on maintaining the same percentage of the target labels, you can use StratifiedKFold

We can now split the training data into training and validation sets



In [ ]:

    
(feature_train, feature_validate,
 target_train, target_validate) = cross_validation.train_test_split(feature_train, 
                                                                    target_train, 
                                                                    test_size=.33)



In [ ]:

    
net = MLP(5, .1, outtype="softmax")
net.earlystopping(feature_train, target_train, feature_validate, 
                  target_validate, disp=False)



In [ ]:

    
net.confusion_matrix(feature_test, target_test)



In [ ]:

    
net.confusion_matrix(feature_test, target_test, summary=False)

Back-Propagation Derivation

The output, $y$, is a function of $x$, $g(\cdot)$, and the weights
The weights will be denoted $v$ and $w$ for the first and second layers
$i$ is the index over the input nodes, $j$ is the index over the hidden layer neurons, $k$ is the index over the output neurons
First let's write the error function

$$\begin{aligned}E(\boldsymbol{w}) & =\frac{1}{2}\sum_{k=1}^N\left(t_k-y_k\right)^2 \cr & = \frac{1}{2}\sum_k{\left[t_k-g\left(\sum_j w_{jk}a_j \right)\right]^2}\end{aligned}$$

where $a_j$ is the output from the hidden layer neurons

For the moment, let's ignore the hidden layer and work with the perceptron

$$\begin{aligned}E(\boldsymbol{w}) & =\frac{1}{2}\sum_{k=1}^N\left(t_k-y_k\right)^2 \cr & = \frac{1}{2}\sum_k{\left[t_k-g\left(\sum_j w_{jk}x_j \right)\right]^2}\end{aligned}$$

Since we will use gradient descent, unsurprisingly, we will need the gradient
Let's remind ourselves, what is the gradient again?
Recall that $g$ was the binary activation function and, thus, not differentiable, so we ignore it in the below
We adjust the weights to reduce the errors, thus we need

$$\begin{aligned}\frac{\partial E}{\partial w_{ik}} & = \frac{\partial}{\partial w_{ik}}\left(\frac{1}{2}\sum_k\left(t_k-\sum_j w_{jk}x_j\right)^2\right) \cr & = \frac{1}{2}\sum_k2(t_k-y_k)\frac{\partial}{\partial w_{ik}}\left(t_k-\sum_j w_{jk}x_j\right) \cr \end{aligned} $$

note that

$$\frac{\partial t_k}{\partial w_{ik}}=0$$

and

$\frac{\partial}{\partial w_{ik}}\sum_jw_{jk}x_j$

is only non-zero when $i=j$, thus

$$\frac{\partial}{\partial w_{ik}}=x_i$$

so that we have

$$\begin{aligned}\frac{\partial E}{\partial w_{ik}} & = \sum_k(t_k-y_k)\left(-x_i\right) \end{aligned} $$

To make our errors smaller, we follow the gradient "downhill" such that (including the learning rate)

$$w_{ik}\leftarrow w_{ik}+\eta(t_k-y_k)x_i$$

Working with the Activation Function

From the above, it is clear that we need a differentiable $g$ to obtain $\frac{\partial g}{\partial w_{ik}}$ $$a = g(h)=\frac{1}{1+e^{-\beta h}}$$
The derivative of $g$ has a simple form $$\begin{aligned} g^{\prime}(h) & = \frac{d}{dh}\frac{1}{1+e^{-\beta h}} \cr & = \frac{d}{dh}(1+e^{-\beta h})^{-1} \cr & = -(1+e^{-\beta h})^{-2}(-\beta e^{-\beta h}) \cr & = \frac{\beta e^{-\beta h}}{(1+e^{-\beta h})^{2}} \cr & = \beta\frac{1}{1+e^{-\beta h}}\frac{ e^{-\beta h}}{1+e^{-\beta h}} \cr & = \beta\frac{1}{1+e^{-\beta h}}\left(\frac{1 + e^{-\beta h} - 1}{1+e^{-\beta h}}\right) \cr & = \beta g(h)(1-g(h)) \cr & = \beta a(1-a) \cr \end{aligned} $$

Back-Propagation of Error

To use gradient-descent in the MLP, we need the partials of the errors with respect to each weight

$$\frac{\partial E}{\partial w_{jk}}=\frac{\partial E}{\partial h_k}\frac{\partial h_k}{\partial w_{jk}}$$

where $h_k=\sum_lw_{lk}a_l$ is the input to output-layer neuron $k$

Ie., this is the weighted average of the activations of the hidden layer neurons, using second-layer weights
Taking the second part of partials, we have

$$\begin{aligned} {\partial h_k}\frac{\partial h_k}{\partial w_{jk}} & =\frac{\partial \sum_l w_{lk}a_l}{\partial w_{jk}} \cr & = \sum_l\frac{\partial w_{lk}a_l}{\partial w_{jk}} \cr & = a_j \end{aligned} $$

The first term is referred to as the error or delta term

$$\delta_0 = \frac{\partial E}{\partial h_k}$$

We need to unpack this, because we don't know the inputs just the outputs

$$\delta_0 = \frac{\partial E}{\partial h_k}= \frac{\partial E}{\partial y_k}\frac{\partial y_k}{\partial h_k}$$

where the output of the output-layer neuron $k$

$$y_k = g(h_k)=g\left(\sum_j w_{jk}a_j\right)$$

Plugging-in the derivatives we already have for $\delta_0$ gives

$$\begin{aligned} \delta_0 & = \frac{\partial E}{\partial g(h_k)}\frac{\partial g(h_k)}{\partial h_k} \cr & = \frac{\partial}{\partial g(h_k)}\left[\frac{1}{2}\sum_k{\left(t_k-g\left(\sum_j w_{jk}x_j\right)\right)}\right] \frac{\partial g(h_k)}{\partial h_k} \cr & = (g(h_k)-t_k)g^{\prime}(h_k) \cr & = (y_k - t_k)g^{\prime}(h_k) \end{aligned} $$

We already have $g^\prime(h_k)$ so we can put it all together to give the update step for the second-layer weights

$$w_{jk}\leftarrow w_{jk} - \eta \frac{\partial E}{\partial w_{jk}}$$

The missing piece is

$$\begin{aligned} \frac{\partial E}{\partial w_{jk}} &= \delta_0a_j \cr &= (y_k - t_k)y_k(1-y_k)a_j \end{aligned}$$

First layer weights

Now we need the first layer weights - the hidden weights $v_{jk}$
Remember we are going back propagation

$$\begin{aligned} \delta_h &= \sum_k \frac{\partial E}{\partial h_k} \frac{\partial h_k}{\partial h_j} \cr &= \sum_k\delta_0 \frac{\partial h_k}{\partial h_j} \end{aligned}$$

One thing to keep in mind, inputs to the output layer neurons come from the activation of the weighted average hidden layer neurons, using the second-layer weights

$$h_k = g(\sum_lw_{lk}h_l)$$

and

$$\frac{\partial h_k}{\partial h_j}=\frac{\partial g(\sum_l w_{lk}h_l)}{h_j}$$

Noting that $\frac{\partial h_i}{\partial h_j} = 0$ if $i\neq l$

$$\begin{aligned} \frac{\partial h_k}{\partial h_j}&=w_{jk}g^{\prime}(a_j) \cr &=w_{jk}a_j(1-a_j) \end{aligned}$$

This gives a delta term

$$\delta_h = a_j(1-a_j)\sum_k \delta_0 w_{jk}$$

So the update rule $v_{ij}\leftarrow v_{ij} - \eta\frac{\partial E}{\partial v_{ij}}$ needs

$$\frac{\partial E}{\partial v_{ij}} = a_j(1-a_j)(\sum_k \delta_0w_{jk})x_i$$