Intro to Neural Networks

Handwritten Digits Recognization



The Neuron: A Biological Information Processor

  • dentrites - the receivers
  • soma - neuron cell body (sums input signals)
  • axon - the transmitter
  • synapse 突触 - point of transmission
  • neuron activates after a certain threshold is met

Learning occurs via electro-chemical changes in effectiveness of synaptic junction.

An Artificial Neuron: The Perceptron simulated on hardware or by software

  • input connections - the receivers
  • node, unit, or PE simulates neuron body
  • output connection - the transmitter
  • activation function employs a threshold or bias
  • connection weights act as synaptic junctions

Learning occurs via changes in value of the connection weights.

Neural Networks consist of the following components

  • An input layer, x
  • An arbitrary amount of hidden layers
  • An output layer, ŷ
  • A set of weights and biases between each layer, W and b
  • A choice of activation function for each hidden layer, σ.
    • e.g., Sigmoid activation function.

ANNs incorporate the two fundamental components of biological neural nets:

  • Neurones (nodes)
  • Synapses (weights)


In [100]:
1*0.25 + 0.5*(-1.5)


Out[100]:
-0.5

Each iteration of the training process consists of the following steps:

  1. Calculating the predicted output ŷ, known as feedforward
  2. Updating the weights and biases, known as backpropagation

activation function for each hidden layer, σ.

The output ŷ of a simple 2-layer Neural Network is:

$$ \widehat{y} = \sigma (w_2 z + b_2) = \sigma(w_2 \sigma(w_1 x + b_1) + b_2)$$

Lost or Cost function

\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2. \tag{6}\end{eqnarray}

Here, w denotes the collection of all weights in the network, b all the biases, n is the total number of training inputs, a is the vector of outputs from the network when x is input, and the sum is over all training inputs, x.

Chain rule for calculating derivative of the loss function with respect to the weights.

Note that for simplicity, we have only displayed the partial derivative assuming a 1-layer Neural Network.

Gradient Descent

  • Gradient descent 遍历全部数据集算一次损失函数,速度慢,;
  • stochastic gradient descent 速度比较快,但收敛性能不太好。
  • mini-batch gradient decent 把数据分为若干个批,按批来更新参数,一批中的一组数据共同决定了本次梯度的方向,减少了随机性。

Understanding the Mathematics behind Gradient Descent

A simple mathematical intuition behind one of the commonly used optimisation algorithms in Machine Learning.

https://www.douban.com/note/713353797/

Linear neural networks: A simple case

the output signal is created by

  • summing up all the weighted input.
  • No activation function will be applied.

The error is the difference between the target and the actual output: $e_i = \frac{1}{2} ( t_i - o_i ) ^ 2$

  • $e_1 = t_1 - o_1 = 1 - 0.92 = 0.08$

Depending on this error, we have to change the weights accordingly.

  • we can calculate the fraction of the error e1 in w11 as:
  • $e_1 \cdot \frac{w_{11}}{\sum_{i=1}^{4} w_{i1}} = 0.08 \cdot \frac{0.6}{0.6 + 0.4 + 0.1 + 0.2} = 0.037$

We have known that $E = \sum_{j=1}^{n} \frac{1}{2} (t_j - o_j)^2$

And given $t_j$ is a constant, we have:

$\frac{\partial E}{\partial o_{j}} = t_j - o_j$

Apply the chain rule for the differentiation:

$\frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial o_{j}} \cdot \frac{\partial o_j}{\partial w_{ij}}$

$\frac{\partial E}{\partial w_{ij}} = (t_j - o_j) \cdot \frac{\partial o_j}{\partial w_{ij}} $

Further, we often use the sigmoid function as the activation function $\sigma(x) = \frac{1}{1+e^{-x}}$

Given $o_j = \sigma(\sum_{i=1}^{m} w_{ij}h_i)$, we have

$\frac{\partial E}{\partial w_{ij}} = (t_j - o_j) \cdot \frac{\partial }{\partial w_{ij}} \sigma(\sum_{i=1}^{m} w_{ij}h_i)$

And it is easy to differentiate: $\frac{\partial \sigma(x)}{\partial x} = \sigma(x) \cdot (1 - \sigma(x))$

$\frac{\partial E}{\partial w_{ij}} = (t_j - o_j) \cdot \sigma(\sum_{i=1}^{m} w_{ij}h_i) \cdot (1 - \sigma(\sum_{i=1}^{m} w_{ij}h_i)) \frac{\partial }{\partial w_{ij}} \sum_{i=1}^{m} w_{ij}h_i$

$\frac{\partial E}{\partial w_{ij}} = (t_j - o_j) \cdot \sigma(\sum_{i=1}^{m} w_{ij}h_i) \cdot (1 - \sigma(\sum_{i=1}^{m} w_{ij}h_i)) \cdot h_i$

Handwritten Digit Recognition

https://github.com/lingfeiwu/people2vec

Each image has 8*8 = 64 pixels

  • input = 64
    • [0, 0, 1, 0, ..., 0]
  • batch size = 100
    • split data into 100 batches.
  • hidden neurons = 50
  • output = 10
  • using relu activation function

  • Set batch_size = 100 images
  • Given each image 64 pixels
    • input_matrix = 100*64
  • Set #neurons= 50
    • w1 = 64*50
    • hidden_matrix = 100*50
  • Given #output = 10
    • w2 = 50*10
    • output = 100*10

In [139]:
# Author: Robert Guthrie
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

import sys
import matplotlib.cm as cm
import networkx as nx
import numpy as np
import pylab as plt
import matplotlib as mpl
from collections import defaultdict
from matplotlib.collections import LineCollection
%matplotlib inline

from sklearn import datasets
from sklearn.manifold import Isomap
from sklearn.manifold import TSNE
from sklearn.manifold import MDS
from sklearn.decomposition import PCA

1. Loading data


In [102]:
#basic functions
# softmax
def softmax(x):
    e_x = np.exp(x - np.max(x)) # to avoid inf
    return e_x / e_x.sum(axis=0)

def softmaxByRow(x):
    e_x = np.exp(x - x.max(axis=1, keepdims=True))
    return e_x / e_x.sum(axis=1, keepdims=True)

# flush print
def flushPrint(d):
    sys.stdout.write('\r')
    sys.stdout.write(str(d))
    sys.stdout.flush()

In [103]:
# the limits of np.exp
np.exp(1000)


Out[103]:
inf

In [104]:
# load data
digits = datasets.load_digits()
# display data
fig, ax = plt.subplots(5, 5, figsize=(5, 5))
for i, axi in enumerate(ax.flat):
    axi.imshow(digits.images[i], cmap='binary')
    axi.set(xticks=[], yticks=[])


2. Neural Network (using numpy)


In [105]:
# prepare training sets
N,  H, D_in, D_out = 100, 50, 64, 10 # batch size, hidden, input, output dimension
k = 0.9 # the fraction traning data
learning_rate = 1e-6
L = len(digits.data)
l = int(L*k)

In [106]:
L, l


Out[106]:
(1797, 1617)

In [107]:
Batches = {}
M = 200 # number of batches
for j in range(M):
    index=list(np.random.randint(l, size=N)) # randomly sample N data points
    y = np.zeros((N, 10))
    y[np.arange(N), list(digits.target[index])] = 1
    x=digits.data[index]
    Batches[j]=[x,y]

In [108]:
j = 7
x, y = Batches[j]

In [109]:
plt.imshow(x, cmap = 'binary') # 100*64
plt.show()



In [110]:
plt.imshow(y, cmap = 'binary') # 100*10
plt.show()


3. Randomly initializing weights


In [111]:
w1 = np.random.randn(D_in, H)/H 
w2 = np.random.randn(H, D_out)/H
w1c = w1.copy() # for comprision in viz
w2c = w2.copy()

In [112]:
plt.imshow(w1, cmap = 'binary') # 64*50
plt.title('w1', fontsize = 20)
plt.show()



In [113]:
plt.imshow(x.dot(w1), cmap = 'binary') # 100*50
plt.title('h', fontsize = 20)
plt.show()


4. Forwarding

Relu activition function

$$softmax = \frac{e^x}{\sum e^x}$$

In [114]:
h = x.dot(w1)
# relu activation
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
plt.imshow(y_pred, cmap = 'binary') # 100*10
plt.title('predicted_relu', fontsize = 20)
plt.show()



In [115]:
# softmax
y_predS=softmaxByRow(y_pred)
plt.imshow(y_predS, cmap = 'binary') # 100*10
plt.title('predicted_softmax', fontsize = 20)
plt.show()



In [116]:
plt.plot(y_pred[0], 'r-o')
plt.plot(y_predS[0], 'g-s')
plt.show()


5. Loss & Backpropagation


In [119]:
Loss=defaultdict(lambda:[])

loss = np.square(y_predS - y).sum()
Loss[j].append([t,loss])

In [122]:
Loss.items()


Out[122]:
dict_items([(7, [[499, 90.08488189002807]])])

Our goal in training is to find the best set of weights and biases that minimizes the loss function.

  • In order to know the appropriate amount to adjust the weights and biases by,
    • we need to know the derivative of the loss function
      • with respect to the weights and biases.


In [118]:
# Backprop 
grad_y_pred = 2.0 * (y_predS - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2

In [130]:
fig = plt.figure(figsize=(8, 3))
ax=fig.add_subplot(121)
plt.imshow(w1, cmap = 'binary') # 64*50
plt.title('w1_updated', fontsize = 20)
ax=fig.add_subplot(122)
plt.imshow(w2, cmap = 'binary') # 64*50
plt.title('w2_updated', fontsize = 20)
plt.tight_layout()


6. Put Together


In [131]:
w1 = np.random.randn(D_in, H)/H 
w2 = np.random.randn(H, D_out)/H
w1c = w1.copy() # for comprision in viz
w2c = w2.copy()
Loss=defaultdict(lambda:[])
# traning 
for j in Batches:
    flushPrint(j)
    x,y=Batches[j]
    for t in range(500):# repeated use of the same batch
        # Forward
        h = x.dot(w1)
        h_relu = np.maximum(h, 0)
        y_pred = h_relu.dot(w2)
        y_predS=softmaxByRow(y_pred)
        # loss
        loss = np.square(y_predS - y).sum()
        Loss[j].append([t,loss])
        # Backprop 
        grad_y_pred = 2.0 * (y_predS - y)
        grad_w2 = h_relu.T.dot(grad_y_pred)
        grad_h_relu = grad_y_pred.dot(w2.T)
        grad_h = grad_h_relu.copy()
        grad_h[h < 0] = 0
        grad_w1 = x.T.dot(grad_h)
        # Update weights
        w1 -= learning_rate * grad_w1
        w2 -= learning_rate * grad_w2


199

In [132]:
fig = plt.figure(figsize=(8, 3))
ax=fig.add_subplot(121)
plt.imshow(w1, cmap = 'binary') # 64*50
plt.title('w1_updated', fontsize = 20)
ax=fig.add_subplot(122)
plt.imshow(w2, cmap = 'binary') # 64*50
plt.title('w2_updated', fontsize = 20)
plt.tight_layout()



In [133]:
# Dispaly loss decreasing
fig = plt.figure(figsize=(5, 4))
cmap = cm.get_cmap('rainbow',M)
for i in Loss:
    epochs,loss=zip(*sorted(Loss[i]))
    plt.plot(epochs,loss,color=cmap(i),alpha=0.7)
plt.xlabel('Epochs',fontsize=18)
plt.ylabel('Loss',fontsize=18)
ax1 = fig.add_axes([0.2, 0.8, 0.65, 0.03])
cb1 = mpl.colorbar.ColorbarBase(ax1, cmap=cmap,
                                norm=mpl.colors.Normalize(vmin=0, vmax=M),
                                orientation='horizontal')
cb1.set_label('N of batches')



In [134]:
# Test
TestData=digits.data[-(L-l):]
PredictData=np.maximum(TestData.dot(w1),0).dot(w2)
compare=np.argmax(PredictData,axis=1)-digits.target[-(L-l):]
Accuracy=list(compare).count(0)/float(len(compare))
Accuracy


Out[134]:
0.9333333333333333

Visualization

3.1 Matrices as manifold translators

before and after training


In [137]:
cmap=plt.cm.get_cmap('Accent', 10)
fig = plt.figure(figsize=(12, 4))
fig.add_subplot(141)
plt.imshow(w1c,cmap='Blues')
plt.title('w1 before training')
fig.add_subplot(142)
plt.imshow(w1,cmap='Blues')
plt.title('w1 after training')
fig.add_subplot(143)
plt.imshow(w2c,cmap='Blues')
plt.title('w2 before training')
fig.add_subplot(144)
plt.imshow(w2,cmap='Blues')
plt.title('w2 after training')
plt.tight_layout()



In [140]:
# dimension reduction for viz
pca = PCA(n_components=2)
projectionPixel = pca.fit_transform(w1) # 64*2
projectionLabel = pca.fit_transform(w2.T) # 10*2

In [145]:
G=nx.grid_2d_graph(8,8)
pos=dict(((i,j),(i,j)) for i,j in G.nodes())
index=sorted(pos.keys())
posPixel=dict(zip(index, projectionPixel))


In [179]:
fig = plt.figure(figsize=(16, 8))
#
ax=fig.add_subplot(241)
data3_1=digits.images[3]
x,y,z=zip(*[(i,j,data3_1[i,j]) for i,j in index])
#nx.draw_networkx_edges(G,pos=pos,color='gray',alpha=0.5)
plt.scatter(y,x,s=100,c=z,cmap='binary')
plt.imshow(data3_1,cmap='Blues')
#
ax=fig.add_subplot(245)
line_segments = LineCollection([[posPixel[i],posPixel[j]] for i,j in G.edges()],\
                               color='gray',zorder=1)
ax.add_collection(line_segments)
x,y,z=zip(*[(projectionPixel[n][0],projectionPixel[n][1],data3_1[xy[0],xy[1]]) \
            for n,xy in enumerate(index)])
plt.scatter(x,y,s=100,c=z,cmap='binary',zorder=2)
#
ax=fig.add_subplot(242)
data3_2=digits.images[13]
x,y,z=zip(*[(i,j,data3_2[i,j]) for i,j in index])
#nx.draw_networkx_edges(G,pos=pos,color='gray',alpha=0.5)
plt.scatter(y,x,s=100,c=z,cmap='binary')
plt.imshow(data3_2,cmap='Blues')
#
ax=fig.add_subplot(246)
line_segments = LineCollection([[posPixel[i],posPixel[j]] for i,j in G.edges()],\
                               color='gray',zorder=1)
ax.add_collection(line_segments)
x,y,z=zip(*[(projectionPixel[n][0],projectionPixel[n][1],data3_2[xy[0],xy[1]]) \
            for n,xy in enumerate(index)])
plt.scatter(x,y,s=100,c=z,cmap='binary',zorder=2)
#
ax=fig.add_subplot(243)
data4_1=digits.images[4]
x,y,z=zip(*[(i,j,data4_1[i,j]) for i,j in index])
#nx.draw_networkx_edges(G,pos=pos,color='gray',alpha=0.5)
plt.scatter(y,x,s=100,c=z,cmap='binary')
plt.imshow(data4_1,cmap='Blues')
#
#
ax=fig.add_subplot(247)
line_segments = LineCollection([[posPixel[i],posPixel[j]] for i,j in G.edges()],\
                               color='gray',zorder=1)
ax.add_collection(line_segments)
x,y,z=zip(*[(projectionPixel[n][0],projectionPixel[n][1],data4_1[xy[0],xy[1]]) \
            for n,xy in enumerate(index)])
plt.scatter(x,y,s=100,c=z,cmap='binary',zorder=2)
#
ax=fig.add_subplot(244)
data4_2=digits.images[14]
x,y,z=zip(*[(i,j,data4_2[i,j]) for i,j in index])
#nx.draw_networkx_edges(G,pos=pos,color='gray',alpha=0.5)
plt.scatter(y,x,s=100,c=z,cmap='binary')
plt.imshow(data4_2,cmap='Blues')
#
#
ax=fig.add_subplot(248)
line_segments = LineCollection([[posPixel[i],posPixel[j]] for i,j in G.edges()],\
                               color='gray',zorder=1)
ax.add_collection(line_segments)
x,y,z=zip(*[(projectionPixel[n][0],projectionPixel[n][1],data4_2[xy[0],xy[1]]) \
            for n,xy in enumerate(index)])
plt.scatter(x,y,s=100,c=z,cmap='binary',zorder=2)
#
plt.tight_layout()
plt.show()


3.3 Alternative manifold learning methods


In [113]:
# dimension reduction for viz
pca = PCA(n_components=2)
iso = Isomap(n_components=2)
tsne = TSNE(n_components=2)
mds = MDS(n_components=2)
#
encodeData = digits.data.dot(w1)
projection0 = pca.fit_transform(digits.data)
projection1 = pca.fit_transform(encodeData)
projection2 = mds.fit_transform(digits.data)
projection3 = iso.fit_transform(digits.data)
projection4 = tsne.fit_transform(digits.data)
#
targ = np.zeros((len(digits.target), 10))
targ[np.arange(len(digits.target)), list(digits.target)] = 1
encodeTarget = w2.dot(targ.T).T
projection11 = pca.fit_transform(encodeTarget)

In [320]:
# viz
cmap=plt.cm.get_cmap('Accent', 10)
fig = plt.figure(figsize=(12, 8))
#
def viz(projection,ax,title):
    plt.scatter(projection[:, 0], projection[:, 1],lw=0,s=6,c=digits.target, cmap=cmap)
    #plt.colorbar(ticks=range(10), label='digit value')
    plt.title(title)
    ax.set_axis_bgcolor('black')    
#
ax=fig.add_subplot(233)
xs,ys=np.round(projection11,3).T
for x,y,i in set(zip(xs, ys,digits.target)):
    plt.scatter(x,y,alpha=0)
    plt.text(x,y,str(i),color=cmap(i),size=18)
ax.set_axis_bgcolor('black')
plt.title('PCA encode label (50D)')
#
viz(projection0,fig.add_subplot(231),'PCA raw data (64D)')
viz(projection1,fig.add_subplot(232),'PCA encode data (50D)')
#viz(projection11,fig.add_subplot(233),'PCA encode label (50D)')
viz(projection2,fig.add_subplot(234),'MDS raw data (64D)')
viz(projection3,fig.add_subplot(235),'Isomap raw data (64D)')
viz(projection4,fig.add_subplot(236),'TSNE raw data (64D)')
#
plt.tight_layout()


Neural Network (using PyTorch)


In [84]:
# initialize
dtype = torch.FloatTensor
w1 = Variable(torch.randn(D_in, H).type(dtype)/H, requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype)/H, requires_grad=True)
learning_rate = 1e-6
Loss=defaultdict(lambda:[])

In [94]:
# train
for j in Batches:
    flushPrint(j)
    x,y=Batches[j]
    x = Variable(torch.from_numpy(x).type(dtype), requires_grad=False)  
    y = Variable(torch.from_numpy(y).type(dtype), requires_grad=False)
    for t in range(500):
        y_pred = x.mm(w1).clamp(min=0).mm(w2)
        softmax = nn.Softmax(dim=1)
        y_soft=softmax(y_pred)
        loss = (y_soft - y).pow(2).sum()
        Loss[j].append([t,loss.data.item()])
        loss.backward()
        w1.data -= learning_rate * w1.grad.data
        w2.data -= learning_rate * w2.grad.data
        w1.grad.data.zero_()
        w2.grad.data.zero_()


199

In [96]:
# Dispaly loss decreasing
fig = plt.figure(figsize=(5, 4))
cmap = cm.get_cmap('rainbow',M)
for i in Loss:
    epochs,loss=zip(*sorted(Loss[i]))
    plt.plot(epochs,loss,color=cmap(i),alpha=0.7)
plt.xlabel('Epochs',fontsize=18)
plt.ylabel('Loss',fontsize=18)
ax1 = fig.add_axes([0.2, 0.8, 0.65, 0.03])
cb1 = mpl.colorbar.ColorbarBase(ax1, cmap=cmap,
                                norm=mpl.colors.Normalize(vmin=0, vmax=M),
                                orientation='horizontal')
cb1.set_label('N of batches')



In [97]:
TestData=digits.data[-(L-l):]
xTest = Variable(torch.from_numpy(TestData).type(dtype), requires_grad=False)
PredictData = xTest.mm(w1).clamp(min=0).mm(w2)
compare=np.argmax(PredictData.data.numpy(),axis=1)-digits.target[-(L-l):]
Accuracy=list(compare).count(0)/float(len(compare))
Accuracy


Out[97]:
0.9277777777777778