In [1]:
import pandas as pd
import numpy as np
from random import randint
from pandas import Series,DataFrame
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
import math
%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

In [2]:
%%html
<style>
table {float:left}
</style>


Using the MNIST digit dataset.

Source: http://yann.lecun.com/exdb/mnist/

sklearn has nicely imported it for us.

Classes 10
Samples per class ~180
Samples total 1797
Dimensionality 64
Features integers 0-16

In [3]:
from sklearn.datasets import load_digits
digits = load_digits(n_class=10)
print type(digits)

import random
digits_sample = random.sample(range(0,digits.images.shape[0]),10)
print digits_sample

#show sample digits
plt.rcParams['figure.figsize'] = (12, 4)
f, axarr = plt.subplots(2, 5)
for j in range(0,axarr.shape[1]):
    for i in range(0,axarr.shape[0]):
        axarr[i,j].matshow(digits.images[digits_sample[i+j]])
plt.show()


<class 'sklearn.datasets.base.Bunch'>
[176, 1368, 568, 1241, 1600, 1161, 264, 434, 945, 717]

Multi-Layer Feed-Forward Neural Network

(in this case, 3 layers [input, 1 hidden, output])

Source: https://en.wikipedia.org/wiki/Feedforward_neural_network

Source2: http://neuralnetworksanddeeplearning.com/chap1.html

The structure is something like the below.

In our case, each image have a total of 8x8 = 64 pixels, so we will be using 64-input neurons instead of 784. We can try using the same number of 15 hidden neurons.

(Important Note: since each node (same layer) is getting exactly the same input, it is of utmost importance to initialize all the weights with different (randomized) values! Else the output will be the same for all eternity)

Let's define some terms.

symbol meaning
i index of input to the neurons
k index of the neurons (per layer)
j index of data input rows (each digit image)
w weight vector for neuron connections
$w_{i,k}$ weight connection from i input to k neuron
x or X vector or array of input data
y or Y vector or array of target data (0 or 1 for each digit)
f or f(x,..) activation function, in this case it's a sigmoid.
$\eta$ learning rate. (pronouce:"eta")

The Node's activation function

10-digit multi-classification where probability for each node's output is a sigmoid function

$$ f(w,X) = \dfrac{1}{1+e^{-(w \cdot X)}} $$

(Note: the weight vector "w" includes bias component as $w_0$. the input array ("X") will always have first input as a "1", so bias is always activated.)

Loss function

Loss function is a mean-square error(MSE).

$$ L = \dfrac{1}{2m}\sum_{j=1}^m \bigl\|\;f(w,x_j) - y_j\;\bigr\|^2 $$

Output layer

The first partial derivative is the gradient, in simplified term is:

$$ \Delta w_{i,k,j} = - (y_k-f_{k,j}) \cdot f_{k,j} \cdot (1-f_{k,j}) \cdot x_{i,j}$$

we can sum out all the gradient of each training sample like so:

$$ \Delta w_{i,k} = - \sum_{j=1}^m (y_k-f_{k,j}) \cdot f_{k,j} \cdot (1-f_{k,j}) \cdot x_{i,j}$$

In shorter form, we group the inner part under a new variable "delta":

$$ \Delta w_{i,k} = - \sum_{j=1}^m \delta_{k,j} \cdot x_{i,j}$$

Then we can update the w (with learning rate)

$$ w_{i,k}(t+1) = w_{i,k}(t) - \eta \sum_{j=1}^m \delta_{k,j} \cdot x_{i,j}$$

Specifically, for the weights from hidden to output layer ($w^{out}$). The input "$x$" is actually the resulting output from the lower layer $f^{hid}$.

w hidden-to-output layer update:
$$ w_{i,k}^{out}(t+1) = w_{i,k}^{out}(t) - \eta \sum_{j=1}^m \delta_{k,j}^{out} \cdot f_{i,k}^{hid}$$

Intuition behind the formula

For the update gradient ($\Delta w$) itself excluding the learning rate, it comprises of 3 parts:

$$ \begin{align} A &= (y_k-f_{k,j}^{out}) \\ B &= f_{k,j}^{out} \cdot (1-f_{k,j}^{out}) \\ C &= f_{i,k}^{hid} \end{align} $$
  • A is the magnitude of errors. its value is between (-1 to +1). The bigger the error, the bigger the correction.
  • B is the magnitude of "uncertainty". B's value can go between (0 to 0.5) with a dome-like shape (see below). This means the magnitude of corrections varies according to the confidence of its previous prediction ($f^{out}$). If $f^{out}$ is at its peak of 0.5, meaning that the truth could equally be 0 or 1 -- its warrants a large correction.
  • C is the magnitude of input. its value is between (0 to 1). We multiply the gradient with this so that the correction will scale with the input.


In [207]:
# Graphing B part's formula
graph_x = np.arange(0,1.05,0.05)
graph_y = graph_x*(1-graph_x)
plt.title('B Part value range')
plt.xlabel('f_out')
plt.ylabel('B value')
plt.plot(graph_x, graph_y)
plt.ylim([0,0.30])
plt.show()



Hidden Layer

Moving on to the lower layer -- from input to hidden.

Here comes the backpropagation step --summing back the corrections from output layer:

$$ \Delta w_{i,k}^{hid} = - \sum_{j=1}^m \bigl\{ \bigl(\sum_{k_{out}=1}^n \delta_{k_{out},j}^{out} \cdot w_{k_{hid},k_{out}}^{out}(t) \bigr) \cdot f_{k,j}^{hid} \cdot (1-f_{k,j}^{hid}) \cdot x_{i,j} \bigr\}$$

Again, we can shorten it to:

$$ \Delta w_{i,k}^{hid} = - \sum_{j=1}^m \delta_{k,j}^{hid} \cdot x_{i,j} $$

And we can update w as such:

w input-to-hidden layer update:
$$ w_i^{hid}(t+1) = w_i^{hid}(t) - \eta \sum_{j=1}^m \delta_{i,j}^{hid} \cdot x_{i,j}$$

Intuition behind the formula, part duex

we continue the analysis of the gradient ($\Delta w$), this time for the hidden layer:

$$ \begin{align} D &= \sum_{k_{out}=1}^n \delta_{k_{out},j}^{out} \cdot w_{k_{hid},k_{out}}^{out}(t) \\ E &= f_{k,j}^{hid} \cdot (1-f_{k,j}^{hid}) \\ F &= x_{i,j} \end{align} $$
  • D is the magnitude of errors. It is a sum of output layer's delta values back to the node $k^{hid}$, scaled by its outgoing weights.
  • E is the magnitude of "sureness". it works the same way as B above.
  • F is the magnitude of input. it works the same way as C above. The input of different features should be standardized such that their magnitudes have similar scale.

For someone who hates math equations

All the math notations actually made my head hurt. (And I've written those myself!)

For step-by-step calculation with actual numbers, this website has a really good write-up.


Optimization method

we are going just sum up the deltas from all the samples (all 1,797 of them) and update the weights in one go. Since it's a matrix operation, the speed is pretty fast.

I was considering doing mini-batch and other tricks, but for this Notebook I just want a clear step-by-step example of the actual algorithm itself.




In [598]:
#set size of input, features, hidden, target
sample_size = digits.images.shape[0]
feature_size = digits.images.shape[1]*digits.images.shape[2]
target_size = 10
hidden_size = 15

#make a flat 10 output with all zeros
Y = np.zeros((sample_size,10))
for j in range(0,sample_size):
    Y[j][digits.target[j]] = 1

#make a row of 64 input features instead of 8x8
X = digits.images[0:sample_size].reshape(sample_size,feature_size)
X = (X-8)/8 #normalized 
Xb = np.insert(X,0,1,axis=1) #add bias input, always activated

In [599]:
def sigmoid(w,X):
    a = 1.0/(1.0 + np.exp(-w.dot(X.transpose())))
    return a.transpose()

def loss_func(Y,y_pred):
    return (0.5/sample_size)*np.sum((Y-y_pred)**2) #element-wise operation then aggregate

In [600]:
#initialize the rest of the terms
# for weights --> index = (output node , input node)
w_hid = (np.random.rand(hidden_size,feature_size+1)-0.5) #randomized, and don't forget the bias!
w_out = (np.random.rand(target_size,hidden_size+1)-0.5) #randomized, and don't forget the bias!

#for f --> index = (data row , node)
f_hid = np.random.rand(sample_size,hidden_size)
f_out = np.random.rand(sample_size,target_size)

#for deltas  --> index = (data row , node)
delta_hid = np.random.rand(sample_size,hidden_size)
delta_out = np.random.rand(sample_size,target_size)

In [601]:
#verification with dummy data
#checking numbers from https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
#sample_size = 1
#X = np.array([[0.05,0.1]])
#Xb = np.array([[1, 0.05,0.1]])
#Y = np.array([[0.01,0.99]])
#w_hid = np.array([[0.35,0.15,0.20],[0.35,0.25,0.3]])
#w_out = np.array([[0.60,0.40,0.45],[0.60,0.50,0.55]])
#w_out_bef = w_out.copy()

In [602]:
#run configuration
max_epoch = 5000
min_loss_criterion = 10**-4

#doing 1st forward pass to calculate loss
f_hid = sigmoid(w_hid,Xb)
f_hid_b = np.insert(f_hid,0,1,axis=1) #bias activation for next layer
f_out = sigmoid(w_out,f_hid_b)
curr_loss = loss_func(Y,f_out)
loss = []
loss.append(curr_loss)
print 'start_loss = {}'.format(curr_loss*2)

learning_rate = 0.7/sample_size
learning_rate_bias = 0.7/sample_size

for i in range(0,max_epoch):
    
    #update the weights of output layer
    delta_out = (f_out - Y)*(f_out)*(1-f_out) #element-wise operation 
    wgrad_out = np.einsum('ki,kj->ij', delta_out, f_hid) #dot operation already sums it up     
    w_out_bef = w_out.copy()
    w_out[:,1:] = w_out[:,1:] -learning_rate*(wgrad_out)
    w_out[:,0] = w_out[:,0] -learning_rate_bias*np.sum(delta_out,axis=0)*1.0

    #update the weights of hidden layer
    delta_hid = delta_out.dot(w_out_bef[:,1:])*(f_hid)*(1-f_hid) #dot then element-wise operation      
    wgrad_hid = np.einsum('ki,kj->ij',delta_hid,Xb[:,1:])
    w_hid[:,1:] = w_hid[:,1:] -learning_rate*wgrad_hid
    w_hid[:,0] = w_hid[:,0] -learning_rate_bias*np.sum(delta_hid,axis=0)*1.0
    
    #re-calculate loss
    f_hid = sigmoid(w_hid,Xb)
    f_hid_b = np.insert(f_hid,0,1,axis=1) #bias activation for next layer
    f_out = sigmoid(w_out,f_hid_b)
    curr_loss = loss_func(Y,f_out)
    loss.append(curr_loss)

    #stopping criterion
    if (i>10) and ((loss[-2] - curr_loss) < min_loss_criterion): 
        print 'stop at {}'.format(i)
        break

print 'end_loss = {}'.format(loss[-1])
        
plt.figure()
plt.xlabel('no. of run')
plt.ylabel('loss function')
sns.tsplot(loss)


start_loss = 2.78434781239
stop at 941
end_loss = 0.0849270706587
Out[602]:
<matplotlib.axes._subplots.AxesSubplot at 0x3933f2e8>

In [605]:
#get the prediction to compare with target
y_pred = np.argmax(f_out,axis=1)

from sklearn.metrics import confusion_matrix
cm_mat = confusion_matrix(digits.target[0:sample_size],y_pred)
print cm_mat.T
accuracy = np.trace(cm_mat)*100.0/sample_size
print 'Accuracy = {:.2f}%'.format(accuracy)
print 'Actually this is not true accuracy because we didn\'t verify it with the test dataset.'
df_temp = pd.DataFrame(cm_mat.flatten()[np.newaxis].T,columns = ['values'])
plt.figure(figsize = (6,4),dpi=600)
sns.heatmap(cm_mat.T, cbar=True ,annot=True, fmt=',.0f')
plt.title('Confusion Matrix')
plt.xlabel('Truth')
plt.ylabel('Predicted')


[[177   0   0   0   0   1   1   0   0   0]
 [  0 170   0   0   2   1   1   0  15   2]
 [  0   1 176   3   0   0   0   0   2   0]
 [  0   0   0 177   0   0   0   0   2   1]
 [  1   0   0   0 176   0   0   1   0   0]
 [  0   2   0   2   0 176   0   0   4   3]
 [  0   2   0   0   0   1 179   0   3   0]
 [  0   0   1   0   1   1   0 177   1   5]
 [  0   3   0   1   1   0   0   1 145   3]
 [  0   4   0   0   1   2   0   0   2 166]]
Accuracy = 95.66%
Actually this is not true accuracy because we didn't verify it with the test dataset.
Out[605]:
<matplotlib.text.Text at 0x39b21128>

In [ ]: