In [1]:
import numpy as np
import matplotlib

General terminology and notations

The notation in here follow the notations in Chapter 2 of the deep learning online book. We note that there are different notations in different places, all refer differently to the same mathematical construct.

  • We consider $L$ layers marked by $l=1,\ldots,L$ where $l=1$ denotes the input layer
  • Layer $l$ has $s_l$ units referred to by $a^{l}_j, j = 1,\ldots,s_l$.
  • The matrix $W^l : s_{l} \times s_{l-1} , l=2,\ldots, L$ controls the mapping from layer $l-1$ to layer $l$. The vector $\mathbf{b}^l$ of size $s_{l}$ corresponds to the bias term and in layer $l$. The weight $w^l_{ij}$ is the weight associated with the connection of neuron $j$ in layer $l-1$ to neuron $i$ in layer $l$

  • Forward propagation: $\mathbf{a}^l = \sigma(\mathbf{z}^l)$ where $\mathbf{z}^l=W^l \mathbf{a}^{(l-1)}+ \mathbf{b}^{(l)} , l =2,\ldots,L$ where the activation function $\sigma \equiv \sigma_l$ is applied to each component of its argument vector. For simplicity of notations we often write $\sigma$ instead of $\sigma_l$

Synonims

  • Neuron - inspired from biology analogy
  • Unit - It’s one component of a large network
  • Feature - It implements a feature detector that’s looking at the input and will turn on iff the sought feature is present in the input

A note about dot product

The activation function works on the outcome of the done way to think about this is in terms of correlation which are normalized dot products. Thus what we really measure is the degree of correlation, or dependence, between the input vector and the coefficient vector. We can view at the dot product as:

  • A correlation filter - fires if a correlation between input and weights exceeds a threshold
  • A feature detector - Detect if a specific pattern occur in the input

Output unit

The output values are computed by similarly multplying the values oh $h$ by another weight matrix,

$\def \mathbf \mathbf {}$

$\mathbf{a}^L = \sigma_L(\mathbf{W^L} \cdot \mathbf{a}^{(L-1)} + \mathbf{b}^L) = \sigma_L(\mathbf{z^L}) $

Linear regression network

Defined when $\sigma_L=I$. In that case the output is in a form suitable for linear regression.

Softmax function

For classification problems we want to convert them into probabilities. This is achieved by using the softmax function.

$\sigma_L(z) = \frac{1}{\alpha}e^z$ where $\alpha = \sum_i e^{z^L_i}$ which produces an output vector $\mathbf{a}^L \equiv \mathbf{y} = (y_1,\ldots,y_{s_L}), y_i = \frac{e^{z^L_i}}{\sum_i e^{z^L_i}} , i = 1,\ldots, s_L$

The element $y_i$ is the probability that the label of the output is $i$. This is indeed the same expression utilized by logistic regression for classification of many labels. The label $i^*$ that corresponds to a given input vector $\mathbf{a}^1$ ise selected as the index $i^*$ for which $y_i$ is maximal.

Example of some popular activation functions:

  • Sigmoid: Transfor inner product into an S shaped curve. There are several popular alternatives for a Sigmoid activation function:
    • The logistic function: $\sigma(z) = \frac{1}{1+ e^{-z}}$ hase values in [0,1] and thus can be interperable as probabiliy.
    • Hyperbolic tangent: $\sigma(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$ with values in $(-1,1)$
  • Rectifier: $\sigma(z) = \max(0,z)$. A unit that user a rectifier function is called a rectified linear unit (ReLU).
  • softplus: $\sigma(z) = \ln (1+e^z)$ is a smooth approximation to the rectifier function.

Synonyms for the term "unit activation"

  • Unit's value: View it as a function of the input
  • Activation: Emphasizes that the unit may be responding or not, or to an extent; it’s most appropriate for logistic units
  • Output

Python example : some activation functions


In [2]:
def sigmoid(x):
    return 1./(1.+np.exp(-x))

def rectifier(x):
    return np.array([max(xv,0.0) for xv in x])

def softplus(x):
    return np.log(1.0 + np.exp(x))

x     = np.array([1.0,0,0])
w     = np.array([0.2,-0.03,0.14])
print ' Scalar product between unit and weights ',x.dot(w)
print ' Values of Sigmoid activation function   ',sigmoid(x.dot(w))
print ' Values of ta      activation function   ',np.tanh(x.dot(w))
print ' Values of sofplus activation function   ',softplus(x.dot(w))


 Scalar product between unit and weights  0.2
 Values of Sigmoid activation function    0.549833997312
 Values of ta      activation function    0.197375320225
 Values of sofplus activation function    0.798138869382

In [3]:
import pylab

z  = np.linspace(-2,2,100) # 100 linearly spaced numbers
s  = sigmoid(z)    # computing the values of 
th = np.tanh(z)    # computing the values of 
re = rectifier(z)  # computing the values of rectifier
sp = softplus(z)  # computing the values of rectifier

# compose plot
pylab.plot(z,s) 
pylab.plot(z,s,'co',label='Sigmoid')   # Sigmoid 
pylab.plot(z,th,label='tanh')          # tanh
pylab.plot(z,re,label='rectifier')     # rectifier
pylab.plot(z,sp,label='softplut')     # rectifier
pylab.legend()
pylab.show() # show the plot

Python example : Simple feed forward classification NN


In [6]:
def softmax(z):
    alpha = np.sum(np.exp(z))
    return np.exp(z)/alpha

# Input
a0  = np.array([1.,0,0])

# First layer
W1   = np.array([[0.2,0.15,-0.01],[0.01,-0.1,-0.06],[0.14,-0.2,-0.03]])
b1   = np.array([1.,1.,1.])
z1   = W1.dot(a0) + b1
a1   = np.tanh(z1)

# Output layer
W2   = np.array([[0.08,0.11,-0.3],[0.1,-0.15,0.08],[0.1,0.1,-0.07]])
b2   = np.array([0.,1.,0.])
z2   = W2.dot(a1) + b2
a2 = y  = softmax(z2)
imax = np.argmax(y)

print ' z1 ',z1
print ' a1 ',np.tanh(z1)
print ' z2 ',z2
print ' y  ',y
print ' Input vector {0} is classified to label {1} '.format(a0,imax)


print '\n'
for i in [0,1,2]:
    print 'The probablity for classifying to label ',i,' is ',y[i]


  z1  [ 1.2   1.01  1.14]
 a1  [ 0.83365461  0.76576202  0.81441409]
 z2  [-0.09339804  1.03365429  0.10293268]
 y   [ 0.18855564  0.58198547  0.22945889]
 Input vector [ 1.  0.  0.] is classified to label 1 


The probablity for classifying to label  0  is  0.188555644067
The probablity for classifying to label  1  is  0.581985469162
The probablity for classifying to label  2  is  0.229458886771

Cost (or error) functions

Suppose that the expected output for an input vector ${\bf x} \equiv {\bf a^1}$ is ${\bf y} = {\bf y_x}^* = (0,1,0)$, we can now compute the error vector ${\bf e}= {\bf e_x}= {\bf a_x}^L-{\bf y_x}^*$. With this error, we can now compute a cost $C=C_x$ assotiated with the output $\bf{y_x}$ of the input vector ${\bf x}$ (also called loss) function. For convinience of notations we will frequently omitt the subscript $x$.

Popular loss functions are:

  • Absolute cost $C = C({\bf a}^L)=\sum_i |e_i|$
  • Square cost $C= C({\bf a}^L) = \sum_i e_i^2$
  • Cross entropy loss $C=C({\bf a}^L) = -\sum_i y_i^*\log{a^L_i} \equiv -\sum_i y_i^*\log{y_i}$. The rationale here is that the output of the softmax function is a probability distribution and we can also view the real label vector $y$ as a probability distribution (1 for the corerct label and 0 for all other labels). The cross entropy function is a common way to measure difference between distributions.

The total error from all $N$ data vectors is computed as the average of the individual error terms associated with each input vector ${\bf x}$, that is:$\frac{1}{N} \sum_x C_x$


In [11]:
def abs_loss(e):
    return np.sum(np.abs(e))

def sqr_loss(e):
    return np.sum(e**2)

def cross_entropy_loss(y_estimated,y_real):
    return -np.sum(y_real*np.log(y_estimated))

y_real = np.array([0.,1.,0])
err    = a2 - ystar

print ' Error                                          ',err
print ' Absolute loss                                  ',abs_loss(err)
print ' Square loss                                    ',sqr_loss(err)
print ' Cross entropy loss                             ',cross_entropy_loss(a2,y_real)


 Error                                           [ 0.18855564 -0.41801453  0.22945889]
 Absolute loss                                   0.836029061676
 Square loss                                     0.262940759619
 Cross entropy loss                              0.541309798638

Backpropagation

Backpropagation is a fast way of computing the derivatives $\frac{\partial C}{\partial w^l_{ij}}$ and $\frac{\partial C}{\partial b_i}$ which are needed for the Stochastic Gradient Descent procedure used for minimizing the cost function. Backpropagation is a special case of a more general technique called reverse mode automatic differentiation. The backpropagation algorithm is a smart application of the chain rule to allow efficient calculation of needed derivatives. A detailed discussion on the derivation of backpropagation if provided in this tutorial. An example for a simple python implementation is provided in here.

http://upul.github.io/2015/10/12/Training-(deep)-Neural-Networks-Part:-1/

Derivation of backpropagation

Define the vector ${\mathbf \delta}^l= \frac{\partial C}{\partial z^l}$, that is $\delta^l_i = \frac{\partial C}{\partial z^l_i}, i =1,\ldots,s_l$.

Recall that $z^{l+1}_i = \sum_j w^{l+1}_{ij} a^l_j + b^{l+1}_i = \sum_j w^{l+1}_{ij} \sigma_l(z^l_j) + b^{l+1}_i$ Then we have

  • $\delta^L_i = \frac{\partial C}{\partial z_i^L} = \sum_k \frac{\partial C}{\partial a_k^L} \frac{\partial a_k^L}{\partial z_i^L} = \frac{\partial C}{\partial a_i^L} \frac{\partial a_i^L}{\partial z_i^L}= \frac{\partial C}{\partial a^L_i} \sigma'_L ( z^L_i)$

  • $\delta^l_i = \sum_k \frac{\partial C}{\partial z_k^{l+1}} \frac{\partial z_k^{l+1}}{\partial z_i^L} = \sum_k \delta_k^{l+1} w_{ki}^{l+1} \sigma'_l(z_i^l) = \sigma'_l(z_i^l) \cdot ((W^{l+1})^T \delta^{l+1})_i $

  • $\frac{\partial C}{\partial b^l_{i}} = \delta^{l}_i$

  • $\frac{\partial C}{\partial w^l_{ij}} = \frac{\partial C}{\partial z^{l}_{i}} \frac{\partial z^{l}_{i}}{\partial w^{l}_{ij}} = \delta^{l}_i a^{l-1}_j$

In vector form:

  • $\delta^L = \frac{\partial C}{\partial {\bf a}^L} \odot $$\sigma'_L ({\bf z}^L)$ where $\odot$ is the Hadamard elementwise product.

  • $\delta^l = (W^{l+1})^T \delta^{l+1} \odot \sigma'_l({\bf z}^l)$

  • $\frac{\partial C}{\partial b^l} = \delta^{l}$

  • $\frac{\partial C}{\partial w^l_{ij}} = \delta^{l}_i a^{l-1}_j$