In [1]:
import pandas as pd
import numpy as np
from random import randint
from pandas import Series,DataFrame
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
import math
%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14
In [2]:
%%html
<style>
table {float:left}
</style>
Source: http://yann.lecun.com/exdb/mnist/
sklearn has nicely imported it for us.
Classes | 10 |
---|---|
Samples per class | ~180 |
Samples total | 1797 |
Dimensionality | 64 |
Features | integers 0-16 |
In [3]:
from sklearn.datasets import load_digits
digits = load_digits(n_class=10)
print type(digits)
import random
digits_sample = random.sample(range(0,digits.images.shape[0]),10)
print digits_sample
#show sample digits
plt.rcParams['figure.figsize'] = (12, 4)
f, axarr = plt.subplots(2, 5)
for j in range(0,axarr.shape[1]):
for i in range(0,axarr.shape[0]):
axarr[i,j].matshow(digits.images[digits_sample[i+j]])
plt.show()
Source: https://en.wikipedia.org/wiki/Feedforward_neural_network
Source2: http://neuralnetworksanddeeplearning.com/chap1.html
The structure is something like the below.
In our case, each image have a total of 8x8 = 64 pixels, so we will be using 64-input neurons instead of 784. We can try using the same number of 15 hidden neurons.
(Important Note: since each node (same layer) is getting exactly the same input, it is of utmost importance to initialize all the weights with different (randomized) values! Else the output will be the same for all eternity)
Let's define some terms.
symbol | meaning |
---|---|
i | index of input to the neurons |
k | index of the neurons (per layer) |
j | index of data input rows (each digit image) |
w | weight vector for neuron connections |
$w_{i,k}$ | weight connection from i input to k neuron |
x or X | vector or array of input data |
y or Y | vector or array of target data (0 or 1 for each digit) |
f or f(x,..) | activation function, in this case it's a sigmoid. |
$\eta$ | learning rate. (pronouce:"eta") |
10-digit multi-classification where probability for each node's output is a sigmoid function
$$ f(w,X) = \dfrac{1}{1+e^{-(w \cdot X)}} $$(Note: the weight vector "w" includes bias component as $w_0$. the input array ("X") will always have first input as a "1", so bias is always activated.)
Loss function is a mean-square error(MSE).
$$ L = \dfrac{1}{2m}\sum_{j=1}^m \bigl\|\;f(w,x_j) - y_j\;\bigr\|^2 $$The first partial derivative is the gradient, in simplified term is:
$$ \Delta w_{i,k,j} = - (y_k-f_{k,j}) \cdot f_{k,j} \cdot (1-f_{k,j}) \cdot x_{i,j}$$we can sum out all the gradient of each training sample like so:
$$ \Delta w_{i,k} = - \sum_{j=1}^m (y_k-f_{k,j}) \cdot f_{k,j} \cdot (1-f_{k,j}) \cdot x_{i,j}$$In shorter form, we group the inner part under a new variable "delta":
$$ \Delta w_{i,k} = - \sum_{j=1}^m \delta_{k,j} \cdot x_{i,j}$$Then we can update the w (with learning rate)
$$ w_{i,k}(t+1) = w_{i,k}(t) - \eta \sum_{j=1}^m \delta_{k,j} \cdot x_{i,j}$$Specifically, for the weights from hidden to output layer ($w^{out}$). The input "$x$" is actually the resulting output from the lower layer $f^{hid}$.
For the update gradient ($\Delta w$) itself excluding the learning rate, it comprises of 3 parts:
$$ \begin{align} A &= (y_k-f_{k,j}^{out}) \\ B &= f_{k,j}^{out} \cdot (1-f_{k,j}^{out}) \\ C &= f_{i,k}^{hid} \end{align} $$
In [207]:
# Graphing B part's formula
graph_x = np.arange(0,1.05,0.05)
graph_y = graph_x*(1-graph_x)
plt.title('B Part value range')
plt.xlabel('f_out')
plt.ylabel('B value')
plt.plot(graph_x, graph_y)
plt.ylim([0,0.30])
plt.show()
Moving on to the lower layer -- from input to hidden.
Here comes the backpropagation step --summing back the corrections from output layer:
$$ \Delta w_{i,k}^{hid} = - \sum_{j=1}^m \bigl\{ \bigl(\sum_{k_{out}=1}^n \delta_{k_{out},j}^{out} \cdot w_{k_{hid},k_{out}}^{out}(t) \bigr) \cdot f_{k,j}^{hid} \cdot (1-f_{k,j}^{hid}) \cdot x_{i,j} \bigr\}$$Again, we can shorten it to:
$$ \Delta w_{i,k}^{hid} = - \sum_{j=1}^m \delta_{k,j}^{hid} \cdot x_{i,j} $$And we can update w as such:
we continue the analysis of the gradient ($\Delta w$), this time for the hidden layer:
$$ \begin{align} D &= \sum_{k_{out}=1}^n \delta_{k_{out},j}^{out} \cdot w_{k_{hid},k_{out}}^{out}(t) \\ E &= f_{k,j}^{hid} \cdot (1-f_{k,j}^{hid}) \\ F &= x_{i,j} \end{align} $$All the math notations actually made my head hurt. (And I've written those myself!)
For step-by-step calculation with actual numbers, this website has a really good write-up.
we are going just sum up the deltas from all the samples (all 1,797 of them) and update the weights in one go. Since it's a matrix operation, the speed is pretty fast.
I was considering doing mini-batch and other tricks, but for this Notebook I just want a clear step-by-step example of the actual algorithm itself.
In [598]:
#set size of input, features, hidden, target
sample_size = digits.images.shape[0]
feature_size = digits.images.shape[1]*digits.images.shape[2]
target_size = 10
hidden_size = 15
#make a flat 10 output with all zeros
Y = np.zeros((sample_size,10))
for j in range(0,sample_size):
Y[j][digits.target[j]] = 1
#make a row of 64 input features instead of 8x8
X = digits.images[0:sample_size].reshape(sample_size,feature_size)
X = (X-8)/8 #normalized
Xb = np.insert(X,0,1,axis=1) #add bias input, always activated
In [599]:
def sigmoid(w,X):
a = 1.0/(1.0 + np.exp(-w.dot(X.transpose())))
return a.transpose()
def loss_func(Y,y_pred):
return (0.5/sample_size)*np.sum((Y-y_pred)**2) #element-wise operation then aggregate
In [600]:
#initialize the rest of the terms
# for weights --> index = (output node , input node)
w_hid = (np.random.rand(hidden_size,feature_size+1)-0.5) #randomized, and don't forget the bias!
w_out = (np.random.rand(target_size,hidden_size+1)-0.5) #randomized, and don't forget the bias!
#for f --> index = (data row , node)
f_hid = np.random.rand(sample_size,hidden_size)
f_out = np.random.rand(sample_size,target_size)
#for deltas --> index = (data row , node)
delta_hid = np.random.rand(sample_size,hidden_size)
delta_out = np.random.rand(sample_size,target_size)
In [601]:
#verification with dummy data
#checking numbers from https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
#sample_size = 1
#X = np.array([[0.05,0.1]])
#Xb = np.array([[1, 0.05,0.1]])
#Y = np.array([[0.01,0.99]])
#w_hid = np.array([[0.35,0.15,0.20],[0.35,0.25,0.3]])
#w_out = np.array([[0.60,0.40,0.45],[0.60,0.50,0.55]])
#w_out_bef = w_out.copy()
In [602]:
#run configuration
max_epoch = 5000
min_loss_criterion = 10**-4
#doing 1st forward pass to calculate loss
f_hid = sigmoid(w_hid,Xb)
f_hid_b = np.insert(f_hid,0,1,axis=1) #bias activation for next layer
f_out = sigmoid(w_out,f_hid_b)
curr_loss = loss_func(Y,f_out)
loss = []
loss.append(curr_loss)
print 'start_loss = {}'.format(curr_loss*2)
learning_rate = 0.7/sample_size
learning_rate_bias = 0.7/sample_size
for i in range(0,max_epoch):
#update the weights of output layer
delta_out = (f_out - Y)*(f_out)*(1-f_out) #element-wise operation
wgrad_out = np.einsum('ki,kj->ij', delta_out, f_hid) #dot operation already sums it up
w_out_bef = w_out.copy()
w_out[:,1:] = w_out[:,1:] -learning_rate*(wgrad_out)
w_out[:,0] = w_out[:,0] -learning_rate_bias*np.sum(delta_out,axis=0)*1.0
#update the weights of hidden layer
delta_hid = delta_out.dot(w_out_bef[:,1:])*(f_hid)*(1-f_hid) #dot then element-wise operation
wgrad_hid = np.einsum('ki,kj->ij',delta_hid,Xb[:,1:])
w_hid[:,1:] = w_hid[:,1:] -learning_rate*wgrad_hid
w_hid[:,0] = w_hid[:,0] -learning_rate_bias*np.sum(delta_hid,axis=0)*1.0
#re-calculate loss
f_hid = sigmoid(w_hid,Xb)
f_hid_b = np.insert(f_hid,0,1,axis=1) #bias activation for next layer
f_out = sigmoid(w_out,f_hid_b)
curr_loss = loss_func(Y,f_out)
loss.append(curr_loss)
#stopping criterion
if (i>10) and ((loss[-2] - curr_loss) < min_loss_criterion):
print 'stop at {}'.format(i)
break
print 'end_loss = {}'.format(loss[-1])
plt.figure()
plt.xlabel('no. of run')
plt.ylabel('loss function')
sns.tsplot(loss)
Out[602]:
In [605]:
#get the prediction to compare with target
y_pred = np.argmax(f_out,axis=1)
from sklearn.metrics import confusion_matrix
cm_mat = confusion_matrix(digits.target[0:sample_size],y_pred)
print cm_mat.T
accuracy = np.trace(cm_mat)*100.0/sample_size
print 'Accuracy = {:.2f}%'.format(accuracy)
print 'Actually this is not true accuracy because we didn\'t verify it with the test dataset.'
df_temp = pd.DataFrame(cm_mat.flatten()[np.newaxis].T,columns = ['values'])
plt.figure(figsize = (6,4),dpi=600)
sns.heatmap(cm_mat.T, cbar=True ,annot=True, fmt=',.0f')
plt.title('Confusion Matrix')
plt.xlabel('Truth')
plt.ylabel('Predicted')
Out[605]:
In [ ]: