In [1]:
import numpy as np
import matplotlib
The notation in here follow the notations in Chapter 2 of the deep learning online book. We note that there are different notations in different places, all refer differently to the same mathematical construct.
The matrix $W^l : s_{l} \times s_{l-1} , l=2,\ldots, L$ controls the mapping from layer $l-1$ to layer $l$. The vector $\mathbf{b}^l$ of size $s_{l}$ corresponds to the bias term and in layer $l$. The weight $w^l_{ij}$ is the weight associated with the connection of neuron $j$ in layer $l-1$ to neuron $i$ in layer $l$
Forward propagation: $\mathbf{a}^l = \sigma(\mathbf{z}^l)$ where $\mathbf{z}^l=W^l \mathbf{a}^{(l-1)}+ \mathbf{b}^{(l)} , l =2,\ldots,L$ where the activation function $\sigma \equiv \sigma_l$ is applied to each component of its argument vector. For simplicity of notations we often write $\sigma$ instead of $\sigma_l$
The activation function works on the outcome of the done way to think about this is in terms of correlation which are normalized dot products. Thus what we really measure is the degree of correlation, or dependence, between the input vector and the coefficient vector. We can view at the dot product as:
The output values are computed by similarly multplying the values oh $h$ by another weight matrix,
$\def \mathbf \mathbf {}$
$\mathbf{a}^L = \sigma_L(\mathbf{W^L} \cdot \mathbf{a}^{(L-1)} + \mathbf{b}^L) = \sigma_L(\mathbf{z^L}) $
Defined when $\sigma_L=I$. In that case the output is in a form suitable for linear regression.
For classification problems we want to convert them into probabilities. This is achieved by using the softmax function.
$\sigma_L(z) = \frac{1}{\alpha}e^z$ where $\alpha = \sum_i e^{z^L_i}$ which produces an output vector $\mathbf{a}^L \equiv \mathbf{y} = (y_1,\ldots,y_{s_L}), y_i = \frac{e^{z^L_i}}{\sum_i e^{z^L_i}} , i = 1,\ldots, s_L$
The element $y_i$ is the probability that the label of the output is $i$. This is indeed the same expression utilized by logistic regression for classification of many labels. The label $i^*$ that corresponds to a given input vector $\mathbf{a}^1$ ise selected as the index $i^*$ for which $y_i$ is maximal.
Example of some popular activation functions:
In [2]:
def sigmoid(x):
return 1./(1.+np.exp(-x))
def rectifier(x):
return np.array([max(xv,0.0) for xv in x])
def softplus(x):
return np.log(1.0 + np.exp(x))
x = np.array([1.0,0,0])
w = np.array([0.2,-0.03,0.14])
print ' Scalar product between unit and weights ',x.dot(w)
print ' Values of Sigmoid activation function ',sigmoid(x.dot(w))
print ' Values of ta activation function ',np.tanh(x.dot(w))
print ' Values of sofplus activation function ',softplus(x.dot(w))
In [3]:
import pylab
z = np.linspace(-2,2,100) # 100 linearly spaced numbers
s = sigmoid(z) # computing the values of
th = np.tanh(z) # computing the values of
re = rectifier(z) # computing the values of rectifier
sp = softplus(z) # computing the values of rectifier
# compose plot
pylab.plot(z,s)
pylab.plot(z,s,'co',label='Sigmoid') # Sigmoid
pylab.plot(z,th,label='tanh') # tanh
pylab.plot(z,re,label='rectifier') # rectifier
pylab.plot(z,sp,label='softplut') # rectifier
pylab.legend()
pylab.show() # show the plot
In [6]:
def softmax(z):
alpha = np.sum(np.exp(z))
return np.exp(z)/alpha
# Input
a0 = np.array([1.,0,0])
# First layer
W1 = np.array([[0.2,0.15,-0.01],[0.01,-0.1,-0.06],[0.14,-0.2,-0.03]])
b1 = np.array([1.,1.,1.])
z1 = W1.dot(a0) + b1
a1 = np.tanh(z1)
# Output layer
W2 = np.array([[0.08,0.11,-0.3],[0.1,-0.15,0.08],[0.1,0.1,-0.07]])
b2 = np.array([0.,1.,0.])
z2 = W2.dot(a1) + b2
a2 = y = softmax(z2)
imax = np.argmax(y)
print ' z1 ',z1
print ' a1 ',np.tanh(z1)
print ' z2 ',z2
print ' y ',y
print ' Input vector {0} is classified to label {1} '.format(a0,imax)
print '\n'
for i in [0,1,2]:
print 'The probablity for classifying to label ',i,' is ',y[i]
Suppose that the expected output for an input vector ${\bf x} \equiv {\bf a^1}$ is ${\bf y} = {\bf y_x}^* = (0,1,0)$, we can now compute the error vector ${\bf e}= {\bf e_x}= {\bf a_x}^L-{\bf y_x}^*$. With this error, we can now compute a cost $C=C_x$ assotiated with the output $\bf{y_x}$ of the input vector ${\bf x}$ (also called loss) function. For convinience of notations we will frequently omitt the subscript $x$.
Popular loss functions are:
The total error from all $N$ data vectors is computed as the average of the individual error terms associated with each input vector ${\bf x}$, that is:$\frac{1}{N} \sum_x C_x$
In [11]:
def abs_loss(e):
return np.sum(np.abs(e))
def sqr_loss(e):
return np.sum(e**2)
def cross_entropy_loss(y_estimated,y_real):
return -np.sum(y_real*np.log(y_estimated))
y_real = np.array([0.,1.,0])
err = a2 - ystar
print ' Error ',err
print ' Absolute loss ',abs_loss(err)
print ' Square loss ',sqr_loss(err)
print ' Cross entropy loss ',cross_entropy_loss(a2,y_real)
Backpropagation is a fast way of computing the derivatives $\frac{\partial C}{\partial w^l_{ij}}$ and $\frac{\partial C}{\partial b_i}$ which are needed for the Stochastic Gradient Descent procedure used for minimizing the cost function. Backpropagation is a special case of a more general technique called reverse mode automatic differentiation. The backpropagation algorithm is a smart application of the chain rule to allow efficient calculation of needed derivatives. A detailed discussion on the derivation of backpropagation if provided in this tutorial. An example for a simple python implementation is provided in here.
http://upul.github.io/2015/10/12/Training-(deep)-Neural-Networks-Part:-1/
Define the vector ${\mathbf \delta}^l= \frac{\partial C}{\partial z^l}$, that is $\delta^l_i = \frac{\partial C}{\partial z^l_i}, i =1,\ldots,s_l$.
Recall that $z^{l+1}_i = \sum_j w^{l+1}_{ij} a^l_j + b^{l+1}_i = \sum_j w^{l+1}_{ij} \sigma_l(z^l_j) + b^{l+1}_i$ Then we have
$\delta^L_i = \frac{\partial C}{\partial z_i^L} = \sum_k \frac{\partial C}{\partial a_k^L} \frac{\partial a_k^L}{\partial z_i^L} = \frac{\partial C}{\partial a_i^L} \frac{\partial a_i^L}{\partial z_i^L}= \frac{\partial C}{\partial a^L_i} \sigma'_L ( z^L_i)$
$\delta^l_i = \sum_k \frac{\partial C}{\partial z_k^{l+1}} \frac{\partial z_k^{l+1}}{\partial z_i^L} = \sum_k \delta_k^{l+1} w_{ki}^{l+1} \sigma'_l(z_i^l) = \sigma'_l(z_i^l) \cdot ((W^{l+1})^T \delta^{l+1})_i $
$\frac{\partial C}{\partial b^l_{i}} = \delta^{l}_i$
$\frac{\partial C}{\partial w^l_{ij}} = \frac{\partial C}{\partial z^{l}_{i}} \frac{\partial z^{l}_{i}}{\partial w^{l}_{ij}} = \delta^{l}_i a^{l-1}_j$
$\delta^L = \frac{\partial C}{\partial {\bf a}^L} \odot $$\sigma'_L ({\bf z}^L)$ where $\odot$ is the Hadamard elementwise product.
$\delta^l = (W^{l+1})^T \delta^{l+1} \odot \sigma'_l({\bf z}^l)$
$\frac{\partial C}{\partial b^l} = \delta^{l}$
$\frac{\partial C}{\partial w^l_{ij}} = \delta^{l}_i a^{l-1}_j$
http://karpathy.github.io/neuralnets/
Gradient descent example: http://upul.github.io/2015/10/12/Training-(deep)-Neural-Networks-Part:-1/
http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/
SGD tricks: http://research.microsoft.com/pubs/192769/tricks-2012.pdf
http://www.marekrei.com/blog/26-things-i-learned-in-the-deep-learning-summer-school/
http://code.activestate.com/recipes/578148-simple-back-propagation-neural-network-in-python-s/
http://deeplearning.net/tutorial/
http://stackoverflow.com/questions/15395835/simple-multi-layer-neural-network-implementation