Learning occurs via electro-chemical changes in effectiveness of synaptic junction.
Learning occurs via changes in value of the connection weights.
Neural Networks consist of the following components
ANNs incorporate the two fundamental components of biological neural nets:
In [100]:
1*0.25 + 0.5*(-1.5)
Out[100]:
Each iteration of the training process consists of the following steps:
feedforward
backpropagation
activation function for each hidden layer, σ.
The output ŷ of a simple 2-layer Neural Network is:
$$ \widehat{y} = \sigma (w_2 z + b_2) = \sigma(w_2 \sigma(w_1 x + b_1) + b_2)$$Lost or Cost function
\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2. \tag{6}\end{eqnarray}Here, w denotes the collection of all weights in the network, b all the biases, n is the total number of training inputs, a is the vector of outputs from the network when x is input, and the sum is over all training inputs, x.
Chain rule for calculating derivative of the loss function with respect to the weights.
Note that for simplicity, we have only displayed the partial derivative assuming a 1-layer Neural Network.
Understanding the Mathematics behind Gradient Descent
A simple mathematical intuition behind one of the commonly used optimisation algorithms in Machine Learning.
Depending on this error, we have to change the weights accordingly.
We have known that $E = \sum_{j=1}^{n} \frac{1}{2} (t_j - o_j)^2$
And given $t_j$ is a constant, we have:
$\frac{\partial E}{\partial o_{j}} = t_j - o_j$
Apply the chain rule for the differentiation:
$\frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial o_{j}} \cdot \frac{\partial o_j}{\partial w_{ij}}$
$\frac{\partial E}{\partial w_{ij}} = (t_j - o_j) \cdot \frac{\partial o_j}{\partial w_{ij}} $
Further, we often use the sigmoid function
as the activation function $\sigma(x) = \frac{1}{1+e^{-x}}$
Given $o_j = \sigma(\sum_{i=1}^{m} w_{ij}h_i)$, we have
$\frac{\partial E}{\partial w_{ij}} = (t_j - o_j) \cdot \frac{\partial }{\partial w_{ij}} \sigma(\sum_{i=1}^{m} w_{ij}h_i)$
And it is easy to differentiate: $\frac{\partial \sigma(x)}{\partial x} = \sigma(x) \cdot (1 - \sigma(x))$
$\frac{\partial E}{\partial w_{ij}} = (t_j - o_j) \cdot \sigma(\sum_{i=1}^{m} w_{ij}h_i) \cdot (1 - \sigma(\sum_{i=1}^{m} w_{ij}h_i)) \frac{\partial }{\partial w_{ij}} \sum_{i=1}^{m} w_{ij}h_i$
$\frac{\partial E}{\partial w_{ij}} = (t_j - o_j) \cdot \sigma(\sum_{i=1}^{m} w_{ij}h_i) \cdot (1 - \sigma(\sum_{i=1}^{m} w_{ij}h_i)) \cdot h_i$
Each image has 8*8 = 64 pixels
In [139]:
# Author: Robert Guthrie
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
import sys
import matplotlib.cm as cm
import networkx as nx
import numpy as np
import pylab as plt
import matplotlib as mpl
from collections import defaultdict
from matplotlib.collections import LineCollection
%matplotlib inline
from sklearn import datasets
from sklearn.manifold import Isomap
from sklearn.manifold import TSNE
from sklearn.manifold import MDS
from sklearn.decomposition import PCA
In [102]:
#basic functions
# softmax
def softmax(x):
e_x = np.exp(x - np.max(x)) # to avoid inf
return e_x / e_x.sum(axis=0)
def softmaxByRow(x):
e_x = np.exp(x - x.max(axis=1, keepdims=True))
return e_x / e_x.sum(axis=1, keepdims=True)
# flush print
def flushPrint(d):
sys.stdout.write('\r')
sys.stdout.write(str(d))
sys.stdout.flush()
In [103]:
# the limits of np.exp
np.exp(1000)
Out[103]:
In [104]:
# load data
digits = datasets.load_digits()
# display data
fig, ax = plt.subplots(5, 5, figsize=(5, 5))
for i, axi in enumerate(ax.flat):
axi.imshow(digits.images[i], cmap='binary')
axi.set(xticks=[], yticks=[])
In [105]:
# prepare training sets
N, H, D_in, D_out = 100, 50, 64, 10 # batch size, hidden, input, output dimension
k = 0.9 # the fraction traning data
learning_rate = 1e-6
L = len(digits.data)
l = int(L*k)
In [106]:
L, l
Out[106]:
In [107]:
Batches = {}
M = 200 # number of batches
for j in range(M):
index=list(np.random.randint(l, size=N)) # randomly sample N data points
y = np.zeros((N, 10))
y[np.arange(N), list(digits.target[index])] = 1
x=digits.data[index]
Batches[j]=[x,y]
In [108]:
j = 7
x, y = Batches[j]
In [109]:
plt.imshow(x, cmap = 'binary') # 100*64
plt.show()
In [110]:
plt.imshow(y, cmap = 'binary') # 100*10
plt.show()
In [111]:
w1 = np.random.randn(D_in, H)/H
w2 = np.random.randn(H, D_out)/H
w1c = w1.copy() # for comprision in viz
w2c = w2.copy()
In [112]:
plt.imshow(w1, cmap = 'binary') # 64*50
plt.title('w1', fontsize = 20)
plt.show()
In [113]:
plt.imshow(x.dot(w1), cmap = 'binary') # 100*50
plt.title('h', fontsize = 20)
plt.show()
Relu activition function
In [114]:
h = x.dot(w1)
# relu activation
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
plt.imshow(y_pred, cmap = 'binary') # 100*10
plt.title('predicted_relu', fontsize = 20)
plt.show()
In [115]:
# softmax
y_predS=softmaxByRow(y_pred)
plt.imshow(y_predS, cmap = 'binary') # 100*10
plt.title('predicted_softmax', fontsize = 20)
plt.show()
In [116]:
plt.plot(y_pred[0], 'r-o')
plt.plot(y_predS[0], 'g-s')
plt.show()
In [119]:
Loss=defaultdict(lambda:[])
loss = np.square(y_predS - y).sum()
Loss[j].append([t,loss])
In [122]:
Loss.items()
Out[122]:
Our goal in training is to find the best set of weights and biases that minimizes the loss function.
In [118]:
# Backprop
grad_y_pred = 2.0 * (y_predS - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
In [130]:
fig = plt.figure(figsize=(8, 3))
ax=fig.add_subplot(121)
plt.imshow(w1, cmap = 'binary') # 64*50
plt.title('w1_updated', fontsize = 20)
ax=fig.add_subplot(122)
plt.imshow(w2, cmap = 'binary') # 64*50
plt.title('w2_updated', fontsize = 20)
plt.tight_layout()
In [131]:
w1 = np.random.randn(D_in, H)/H
w2 = np.random.randn(H, D_out)/H
w1c = w1.copy() # for comprision in viz
w2c = w2.copy()
Loss=defaultdict(lambda:[])
# traning
for j in Batches:
flushPrint(j)
x,y=Batches[j]
for t in range(500):# repeated use of the same batch
# Forward
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
y_predS=softmaxByRow(y_pred)
# loss
loss = np.square(y_predS - y).sum()
Loss[j].append([t,loss])
# Backprop
grad_y_pred = 2.0 * (y_predS - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
In [132]:
fig = plt.figure(figsize=(8, 3))
ax=fig.add_subplot(121)
plt.imshow(w1, cmap = 'binary') # 64*50
plt.title('w1_updated', fontsize = 20)
ax=fig.add_subplot(122)
plt.imshow(w2, cmap = 'binary') # 64*50
plt.title('w2_updated', fontsize = 20)
plt.tight_layout()
In [133]:
# Dispaly loss decreasing
fig = plt.figure(figsize=(5, 4))
cmap = cm.get_cmap('rainbow',M)
for i in Loss:
epochs,loss=zip(*sorted(Loss[i]))
plt.plot(epochs,loss,color=cmap(i),alpha=0.7)
plt.xlabel('Epochs',fontsize=18)
plt.ylabel('Loss',fontsize=18)
ax1 = fig.add_axes([0.2, 0.8, 0.65, 0.03])
cb1 = mpl.colorbar.ColorbarBase(ax1, cmap=cmap,
norm=mpl.colors.Normalize(vmin=0, vmax=M),
orientation='horizontal')
cb1.set_label('N of batches')
In [134]:
# Test
TestData=digits.data[-(L-l):]
PredictData=np.maximum(TestData.dot(w1),0).dot(w2)
compare=np.argmax(PredictData,axis=1)-digits.target[-(L-l):]
Accuracy=list(compare).count(0)/float(len(compare))
Accuracy
Out[134]:
In [137]:
cmap=plt.cm.get_cmap('Accent', 10)
fig = plt.figure(figsize=(12, 4))
fig.add_subplot(141)
plt.imshow(w1c,cmap='Blues')
plt.title('w1 before training')
fig.add_subplot(142)
plt.imshow(w1,cmap='Blues')
plt.title('w1 after training')
fig.add_subplot(143)
plt.imshow(w2c,cmap='Blues')
plt.title('w2 before training')
fig.add_subplot(144)
plt.imshow(w2,cmap='Blues')
plt.title('w2 after training')
plt.tight_layout()
In [140]:
# dimension reduction for viz
pca = PCA(n_components=2)
projectionPixel = pca.fit_transform(w1) # 64*2
projectionLabel = pca.fit_transform(w2.T) # 10*2
In [145]:
G=nx.grid_2d_graph(8,8)
pos=dict(((i,j),(i,j)) for i,j in G.nodes())
index=sorted(pos.keys())
posPixel=dict(zip(index, projectionPixel))
In [179]:
fig = plt.figure(figsize=(16, 8))
#
ax=fig.add_subplot(241)
data3_1=digits.images[3]
x,y,z=zip(*[(i,j,data3_1[i,j]) for i,j in index])
#nx.draw_networkx_edges(G,pos=pos,color='gray',alpha=0.5)
plt.scatter(y,x,s=100,c=z,cmap='binary')
plt.imshow(data3_1,cmap='Blues')
#
ax=fig.add_subplot(245)
line_segments = LineCollection([[posPixel[i],posPixel[j]] for i,j in G.edges()],\
color='gray',zorder=1)
ax.add_collection(line_segments)
x,y,z=zip(*[(projectionPixel[n][0],projectionPixel[n][1],data3_1[xy[0],xy[1]]) \
for n,xy in enumerate(index)])
plt.scatter(x,y,s=100,c=z,cmap='binary',zorder=2)
#
ax=fig.add_subplot(242)
data3_2=digits.images[13]
x,y,z=zip(*[(i,j,data3_2[i,j]) for i,j in index])
#nx.draw_networkx_edges(G,pos=pos,color='gray',alpha=0.5)
plt.scatter(y,x,s=100,c=z,cmap='binary')
plt.imshow(data3_2,cmap='Blues')
#
ax=fig.add_subplot(246)
line_segments = LineCollection([[posPixel[i],posPixel[j]] for i,j in G.edges()],\
color='gray',zorder=1)
ax.add_collection(line_segments)
x,y,z=zip(*[(projectionPixel[n][0],projectionPixel[n][1],data3_2[xy[0],xy[1]]) \
for n,xy in enumerate(index)])
plt.scatter(x,y,s=100,c=z,cmap='binary',zorder=2)
#
ax=fig.add_subplot(243)
data4_1=digits.images[4]
x,y,z=zip(*[(i,j,data4_1[i,j]) for i,j in index])
#nx.draw_networkx_edges(G,pos=pos,color='gray',alpha=0.5)
plt.scatter(y,x,s=100,c=z,cmap='binary')
plt.imshow(data4_1,cmap='Blues')
#
#
ax=fig.add_subplot(247)
line_segments = LineCollection([[posPixel[i],posPixel[j]] for i,j in G.edges()],\
color='gray',zorder=1)
ax.add_collection(line_segments)
x,y,z=zip(*[(projectionPixel[n][0],projectionPixel[n][1],data4_1[xy[0],xy[1]]) \
for n,xy in enumerate(index)])
plt.scatter(x,y,s=100,c=z,cmap='binary',zorder=2)
#
ax=fig.add_subplot(244)
data4_2=digits.images[14]
x,y,z=zip(*[(i,j,data4_2[i,j]) for i,j in index])
#nx.draw_networkx_edges(G,pos=pos,color='gray',alpha=0.5)
plt.scatter(y,x,s=100,c=z,cmap='binary')
plt.imshow(data4_2,cmap='Blues')
#
#
ax=fig.add_subplot(248)
line_segments = LineCollection([[posPixel[i],posPixel[j]] for i,j in G.edges()],\
color='gray',zorder=1)
ax.add_collection(line_segments)
x,y,z=zip(*[(projectionPixel[n][0],projectionPixel[n][1],data4_2[xy[0],xy[1]]) \
for n,xy in enumerate(index)])
plt.scatter(x,y,s=100,c=z,cmap='binary',zorder=2)
#
plt.tight_layout()
plt.show()
In [113]:
# dimension reduction for viz
pca = PCA(n_components=2)
iso = Isomap(n_components=2)
tsne = TSNE(n_components=2)
mds = MDS(n_components=2)
#
encodeData = digits.data.dot(w1)
projection0 = pca.fit_transform(digits.data)
projection1 = pca.fit_transform(encodeData)
projection2 = mds.fit_transform(digits.data)
projection3 = iso.fit_transform(digits.data)
projection4 = tsne.fit_transform(digits.data)
#
targ = np.zeros((len(digits.target), 10))
targ[np.arange(len(digits.target)), list(digits.target)] = 1
encodeTarget = w2.dot(targ.T).T
projection11 = pca.fit_transform(encodeTarget)
In [320]:
# viz
cmap=plt.cm.get_cmap('Accent', 10)
fig = plt.figure(figsize=(12, 8))
#
def viz(projection,ax,title):
plt.scatter(projection[:, 0], projection[:, 1],lw=0,s=6,c=digits.target, cmap=cmap)
#plt.colorbar(ticks=range(10), label='digit value')
plt.title(title)
ax.set_axis_bgcolor('black')
#
ax=fig.add_subplot(233)
xs,ys=np.round(projection11,3).T
for x,y,i in set(zip(xs, ys,digits.target)):
plt.scatter(x,y,alpha=0)
plt.text(x,y,str(i),color=cmap(i),size=18)
ax.set_axis_bgcolor('black')
plt.title('PCA encode label (50D)')
#
viz(projection0,fig.add_subplot(231),'PCA raw data (64D)')
viz(projection1,fig.add_subplot(232),'PCA encode data (50D)')
#viz(projection11,fig.add_subplot(233),'PCA encode label (50D)')
viz(projection2,fig.add_subplot(234),'MDS raw data (64D)')
viz(projection3,fig.add_subplot(235),'Isomap raw data (64D)')
viz(projection4,fig.add_subplot(236),'TSNE raw data (64D)')
#
plt.tight_layout()
In [84]:
# initialize
dtype = torch.FloatTensor
w1 = Variable(torch.randn(D_in, H).type(dtype)/H, requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype)/H, requires_grad=True)
learning_rate = 1e-6
Loss=defaultdict(lambda:[])
In [94]:
# train
for j in Batches:
flushPrint(j)
x,y=Batches[j]
x = Variable(torch.from_numpy(x).type(dtype), requires_grad=False)
y = Variable(torch.from_numpy(y).type(dtype), requires_grad=False)
for t in range(500):
y_pred = x.mm(w1).clamp(min=0).mm(w2)
softmax = nn.Softmax(dim=1)
y_soft=softmax(y_pred)
loss = (y_soft - y).pow(2).sum()
Loss[j].append([t,loss.data.item()])
loss.backward()
w1.data -= learning_rate * w1.grad.data
w2.data -= learning_rate * w2.grad.data
w1.grad.data.zero_()
w2.grad.data.zero_()
In [96]:
# Dispaly loss decreasing
fig = plt.figure(figsize=(5, 4))
cmap = cm.get_cmap('rainbow',M)
for i in Loss:
epochs,loss=zip(*sorted(Loss[i]))
plt.plot(epochs,loss,color=cmap(i),alpha=0.7)
plt.xlabel('Epochs',fontsize=18)
plt.ylabel('Loss',fontsize=18)
ax1 = fig.add_axes([0.2, 0.8, 0.65, 0.03])
cb1 = mpl.colorbar.ColorbarBase(ax1, cmap=cmap,
norm=mpl.colors.Normalize(vmin=0, vmax=M),
orientation='horizontal')
cb1.set_label('N of batches')
In [97]:
TestData=digits.data[-(L-l):]
xTest = Variable(torch.from_numpy(TestData).type(dtype), requires_grad=False)
PredictData = xTest.mm(w1).clamp(min=0).mm(w2)
compare=np.argmax(PredictData.data.numpy(),axis=1)-digits.target[-(L-l):]
Accuracy=list(compare).count(0)/float(len(compare))
Accuracy
Out[97]:
This is the end.