Let's first prepare the data from CIFAR10 dataset.
Source: http://www.cs.utoronto.ca/~kriz/cifar.html
The CIFAR-10 dataset consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class.
(we will be using batch_1 only out of 6 batches -- so 10,000 images in total)
The classes are ["airplane","automobile","bird","cat","deer","dog","frog","horse","ship","truck"]
no. of Classes | 10 |
---|---|
Samples per class | ~1,000 |
Samples total | 10,000 |
Dimensionality | 32x32x3 = 3,072 |
Features | int 0-255 each pixel |
In [1]:
import pandas as pd
import numpy as np
from random import randint
from pandas import Series,DataFrame
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
import math
import time
%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14
In [2]:
def unpickle(file):
import cPickle
fo = open(file, 'rb')
dict = cPickle.load(fo)
fo.close()
return dict
In [4]:
# source: http://www.cs.utoronto.ca/~kriz/cifar.html
# get the training data, batch 1 out of 6. do unzip the tar.gz first
from time import strftime
import os
if os.name == 'nt': # Windows
path = 'cifar\\data_batch_1'
else:
path = 'cifar/data_batch_1'
cifar = unpickle('cifar/data_batch_1')
In [5]:
def array_to_image(arr):
im = arr.reshape((32,32,3),order='F')/255.0
im = np.transpose(im,axes=(1,0,2))
return im
In [6]:
# image data is uint8 array format
# 3 RGB channels and 32x32 pixels means 32x32x3 = 1024x3 = 3072 numbers in one array
# 10,000 arrays in dataset
lucky_num = randint(0,10000)
sample_arr = cifar["data"][lucky_num].copy()
print 'array_size = {}'.format(len(sample_arr))
print sample_arr
print 'file_name = {}'.format(cifar["filenames"][lucky_num])
#show image
plt.rcParams['figure.figsize'] = (4, 4)
plt.imshow(array_to_image(sample_arr))
plt.show()
Source1: http://cs231n.github.io/convolutional-networks/
Source2: http://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html
Note: Most of the knowledge below was gleaned from CS231n stanford course (Source1), its figures and explanation are excellent learning materials.
Convolutional Neural Networks (CNN) is essential normal neural nets with the "Convolution" part to transform raw input into meaningful features. Most common image, video, or other "dense" input problem.
The math definition is: "the integral of 2 products after one is reversed and shifted."
$$ (f * g)(t) = \int_{-\infty}^{\infty} f(x) \cdot g(t - x) \; dx $$This image from wiki is much easier to understand.
In matrix form, below is how to convolve an input matrix with another matrix (called a kernel).
The convolution is essentially an element-wise multiplication and then sum of the result.
Note: the kernel needed to be flipped (in every dimension) before doing the multiplication.
Here we have a 5x5 (green) input matrix and a 3x3 (orange) kernel, doing a 1-each stride (moving the kernel 1 cell a a time).
Even more useful is to look at how it was applied in real application.
The most (visually) obvious is as a filter in photo editor -- Edge detection, Blur, etc. The kernel is mostly a mask (cell value either 1 or 0) to enhance or hide away features.
So we know what convolution is now (roughly), let's move on to CNN architecture.
We're going to use the structure from here. The source website actually contains great explanation on many concepts, so I will only do a short summary for our purpose.
Note: For the sake of easier debugging of numpy array print format, I will use this notation [Depth x Height x Width]
Let's get into details of each layer.
Not much to be said here. we will use an image matrix of [RGB,Height,Width] = [3,32,32].
First, we will use a filter (aka. kernel) matrix of RGBx3x3.
In terms of neurons, the no. of weight (connection) for each filter is 3x3x3 + 1 bias = 28.
Since we will have 12 filter slices (also called "depth"), we will have 28x12 = 336 connections total for CONV layer.
The formula belows calculate the shape of the output matrix.
For input Width (or height) = 32, Filter width = 3, zero-Padding = 1, and Stride = 1 :
$$ \begin{align} {Output}& = \dfrac{(W + 2 \cdot P - F)}{S} + 1 \\ {Output}& = \dfrac{(32 + 2 \cdot 1 - 3)}{1} + 1 = 32 \end{align} $$So our output matrix from this layer is [32x32x12].
we will apply a Rectified Linear Unit (ReLU) function here. The output will be the same dimension of [32x32x12].
$$ f(x) = max(0,x) $$Or put it another way:
$$ f(x) = \begin{cases} x & \text{if } x \gt 0 \\ 0 & \text{Otherwise} \end{cases} $$Downsampling from [32x32x12] to [16x16x12].
We are going to use a max function of 2x2 matrix inside the 32x32 layer, therefore the output will be halved -- to [16x16x12].
This is a normal NN output layer, connecting all [16x16x12] input to [1x1x10] predicted output.
We are going to use Softmax function to constrain all the probability output.
$$ z_i = w_{soft} \cdot f_{POOL,i} $$$$ \sigma_{soft}(z_i) = \cfrac{e^{z_{i}}}{ \sum_{j=0}^m e^{z_{j}} } $$However, there is a chance that $ e^{z}$ becomes very large, which could make the computer croak by trying to divide two large numbers.
Therefore, we can add a constant $C$ to both numerator and divisor, so the output magnitude is reasonable.
$$ \sigma_{soft}(z_i) = \cfrac{e^{z_i + C}}{ \sum_{j=0}^m e^{z_j + C} } $$$$ C = -\max z $$Base on Softmax Classifier, the (cross-entropy) loss function is like so:
$$ L_i = - y_i \cdot \log( \; \sigma_{soft, i} \;) $$$$ L_i = - y_i \cdot \log( \; \frac {e^{z_{i}}}{\sum e^z} \;) $$$$ L_i = - ( \; z_{y,i} - \log \sum e^z \;) $$$$ L = \frac{1}{N} \sum L_i $$This roughly mean that for each instance, only use the predicted output ( $ sigma_{soft} $ ) of the correct class for that particular instance. the rest we set to zero.
For total loss over the whole dataset, we can just sum up each loss $L_i$.
This describes the flow of data.
INPUT --> CONV --> RELU --> POOL --> OUT (Softmax)
The equations and derivations are from (this slide). Backprop in stages using chain rule is very useful.
For the OUT layer, the Softmax backprop is per below.
$$ \begin{align} \frac{\partial L}{\partial \sigma_{soft}} &= - \frac{y}{\sigma_{soft} } \\[5pt] \frac{\partial L}{\partial z_i} &= - \frac{y_i}{\sigma_{soft,i} } \cdot \frac{\partial \sigma_{soft, i}}{\partial z_i} + \sum_{j \neq i} - \frac{y_j}{\sigma_{soft,j}} \cdot \frac{\partial \sigma_{soft, j}}{\partial f_{POOL,i}} \\[5pt] \frac{\partial \sigma_{soft, i}}{\partial z_i} &= \sigma_{soft,i} \cdot (1 - \sigma_{soft, i}) \quad \text{,where } \sigma_{soft} \text{ is for } i \\[5pt] \frac{\partial \sigma_{soft, j}}{\partial z_i} &= - \sigma_{soft,j} \cdot \sigma_{soft, i} \qquad \text{,where } \sigma_{soft} \text{ is NOT for } i \\[5pt] \frac{\partial L}{\partial z_i} &= \sigma_{soft, i} - y_i \\[5pt] \frac{\partial L}{\partial w_{soft}} &= (\sigma_{soft, i} - y_i) \cdot f_{POOL,i} \quad \text{, for weight update}\\[5pt] \frac{\partial L}{\partial f_{POOL,i}} &= (\sigma_{soft, i} - y_i) \cdot w_{soft} \qquad \text{, for pass-through to the lower layer} \end{align} $$For the POOL layer, we can pass the gradients directly back to the "max" input, otherwise it is zero.
$$ \frac{\partial L}{\partial f_{RELU,i}} = \begin{cases} \frac{\partial L}{\partial f_{POOL,i}} & \text{if } f_{RELU,i} \text{ is } \max \\ 0 & \text{Otherwise} \end{cases} $$For the RELU layer, again we can pass through the gradients directly if $ \; f_{CONV} \gt 0 $.
$$ \frac{\partial L}{\partial f_{CONV,i}} = \begin{cases} \frac{\partial L}{\partial f_{RELU,i}} & \text{if } f_{CONV,i} \gt 0 \\ 0 & \text{Otherwise} \end{cases} $$For the CONV layer, there are a few steps.
First, let's look a an example of convolving the [2x2] kernel $w_{r,c}$ with [3x3] input $x_{r,c}$. The result is a [2x2] $f_{r,c}$ matrix.
Note: $r$ = row index starting at 0 and $c$ = column index starting at 0. like so:
$w_{0,0}$ | $w_{0,1}$ |
---|---|
$w_{1,0}$ | $w_{1,1}$ |
So for forward-pass with stride = 1, we have all the $f_{r,c}$ like so :
$$ \begin{align} f_{0,0} &= w_{0,0} \cdot x_{0,0} + w_{0,1} \cdot x_{0,1} + w_{1,0} \cdot x_{1,0} + w_{1,1} \cdot x_{1,1} + \text{bias}\\ f_{0,1} &= w_{0,0} \cdot x_{0,1} + w_{0,1} \cdot x_{0,2} + w_{1,0} \cdot x_{1,1} + w_{1,1} \cdot x_{1,2} + \text{bias}\\ f_{1,0} &= w_{0,0} \cdot x_{1,0} + w_{0,1} \cdot x_{1,1} + w_{1,0} \cdot x_{2,0} + w_{1,1} \cdot x_{2,1} + \text{bias}\\ f_{1,1} &= w_{0,0} \cdot x_{1,1} + w_{0,1} \cdot x_{1,2} + w_{1,0} \cdot x_{2,1} + w_{1,1} \cdot x_{2,2} + \text{bias} \end{align} $$Then we can calculate the gradients -- partial derivative with respect to kernel weight $w$.
$f$ here means $f_{CONV}$, the result from CONV layer:
$$ \begin{align} \frac{\partial f_{0,0}}{\partial w_{0,0}} &= x_{0,0} \\[5pt] \frac{\partial f_{i,j}}{\partial w_{r,c}} &= x_{(i+j),(r+c)} \\[5pt] \frac{\partial L}{\partial w_{r,c}} &= \sum_{i=0,j=0}^{n,m} \frac{\partial L}{\partial f_{i,j}} \cdot \frac{\partial f_{i,j}}{\partial w_{r,c}} \\[5pt] \frac{\partial L}{\partial w_{r,c}} &= \sum_{i=0,j=0}^{n,m} \frac{\partial L}{\partial f_{i,j}} \cdot x_{(i+j),(r+c)} \end{align} $$For "bias", the partial derivative is 1, so we can just update it with the sum of gradients:
$$ \frac{\partial L}{\partial \text{bias}} = \sum_{i=0,j=0}^{n,m} \frac{\partial L}{\partial f_{i,j}} \cdot 1 $$We are going to use a mini-batch Stochastic gradient descent (SGD).
The gradient will have momentum and learning rate will decay based on epoch.
Since this code is intended to be clear for educational purpose, its performance are far from production level.
I'm trying to use "just" Numpy array and only use other helper libraries as neccessary.
This ends the theory part. Let's start coding!
...
..
.
In [7]:
# X Format is [image_index, RGB, Height, Width]
X = cifar["data"].copy()
X = X.reshape((-1,3,32,32),order='C')/255.0 - 0.5 #standardize
#set size of input, features, hidden, target
instance_size = X.shape[0]
feature_size = X.shape[1]*X.shape[2]*X.shape[3]
target_size = 10
kernel_size = (12,3,3,3)
weight_size = (10,12,16,16) #softmax layer
#make a flat 10 output with all zeros
Y = np.zeros((instance_size,10))
for j in range(0,instance_size):
Y[j][cifar['labels'][j]] = 1
#split train and test dataset
train_split = 0.8 #6000.0/instance_size #0.8
train_size = int(train_split*instance_size)
test_size = instance_size - train_size
index = np.random.permutation(instance_size)
train_ix, test_ix = index[:train_size], index[train_size:train_size+test_size]
Y_train , Y_test = Y[train_ix,:], Y[test_ix,:]
X_train , X_test = X[train_ix,:,:,:], X[test_ix,:,:,:]
#Xb = np.insert(X_train,0,1,axis=1) #add bias input, always activated
#Xb_test = np.insert(X_test,0,1,axis=1) #add bias input, always activated
In [30]:
def convolve_3d_slow(data,kernel,bias_kernel,stride,pad):
#73.8 ms per call --> about 10 min per 8,000 images
if pad > 0:
tpad = ((0,0),(pad,pad),(pad,pad))
data = np.pad(data, pad_width=tpad, mode='constant', constant_values=0)
k_h = kernel.shape[2]
k_w = kernel.shape[3]
len_h = (data.shape[1] - k_h)/stride + 1
len_w = (data.shape[2] - k_w)/stride + 1
len_d = kernel.shape[0]
output = np.zeros((len_d,len_h,len_w))
kernel_rot = kernel[:,::-1,::-1,::-1]
for h in range(0,len_h*stride,stride):
for w in range(0,len_w*stride,stride):
for d in range(0,len_d,1):
output[d,h/stride,w/stride] = np.sum(data[:,h:(h+k_h),w:(w+k_w)]*kernel_rot[d,:,:,:]) + bias_kernel[d]
return output
import scipy.ndimage as ndi
import scipy.signal as sig
def convolve_3d_fast(data,kernel,bias_kernel,stride,pad):
#use scipy convolve then sum up
if pad > 0:
tpad = ((0,0),(pad,pad),(pad,pad))
data = np.pad(data, pad_width=tpad, mode='constant', constant_values=0)
k_h = kernel.shape[2]
k_w = kernel.shape[3]
len_h = (data.shape[1] - k_h)/stride + 1
len_w = (data.shape[2] - k_w)/stride + 1
len_d = kernel.shape[0]
output = np.zeros((len_d,len_h,len_w))
kernel_rot = kernel[:,::-1,:,:]
# np.rot90(kernel[d,c,:,:],2)
for d in range(0,len_d,1): # no. of kernel
for c in range(0,3): # RGB
output[d,:,:] += sig.convolve2d(data[c,:,:],kernel_rot[d,c,:,:],mode='valid')
output[d,:,:] += bias_kernel[d]
return output
def convolve_3d_faster(data,kernel,bias_kernel,stride,pad):
#only works with stride = 1 and full padding the same size as input data though
len_h = (data.shape[1] - kernel.shape[2] +2*pad)/stride + 1
len_w = (data.shape[2] - kernel.shape[3] +2*pad)/stride + 1
len_d = kernel.shape[0]
output = np.zeros((len_d,len_h,len_w))
kernel_rot = kernel[:,::-1,:,:]
for d in range(0,len_d,1): # no. of kernel
for c in range(0,3): # RGB
output[d,:,:] += ndi.convolve(data[c,:,:],kernel_rot[d,c,:,:],mode='constant',cval=0.0)
output[d,:,:] += bias_kernel[d]
return output
def test_loop(data,kernel,bias_kernel,stride,pad):
#only works with stride = 1 and full padding the same size as input data though
len_h = (data.shape[2] - kernel.shape[2] +2*pad)/stride + 1
len_w = (data.shape[3] - kernel.shape[3] +2*pad)/stride + 1
len_d = kernel.shape[0]
data_size = data.shape[0]
output = np.zeros((data_size,len_d,len_h,len_w))
kernel_rot = kernel[:,::-1,:,:]
for i in range(0,data_size): # no. of sample
for d in range(0,len_d,1): # no. of kernel
for c in range(0,3): # RGB
output[i,d,:,:] += ndi.convolve(data[i,c,:,:],kernel_rot[d,c,:,:],mode='constant',cval=0.0)
output[i,d,:,:] += bias_kernel[d]
return output
def test_fft(data,kernel,bias_kernel,stride,pad):
#only works with stride = 1 and full padding the same size as input data though
if pad > 0:
tpad = ((0,0),(pad,pad),(pad,pad))
data = np.pad(data, pad_width=tpad, mode='constant', constant_values=0)
k_h = kernel.shape[2]
k_w = kernel.shape[3]
len_h = (data.shape[1] - k_h)/stride + 1
len_w = (data.shape[2] - k_w)/stride + 1
len_d = kernel.shape[0]
output = np.zeros((len_d,len_h,len_w))
for d in range(0,len_d,1): # no. of kernel
output[d,:,:] = sig.fftconvolve(data[:,:,:],kernel[d,:,:,:],mode='valid')[0]
output[d,:,:] += bias_kernel[d]
return output
In [31]:
#test convolution performance
k1 = np.arange(0,12*3*(3**2)).reshape(12,3,3,3)
d1 = np.arange(0,1000*3*(32**2)).reshape(1000,3,32,32)
b1 = np.random.uniform(-1,1,size=12)
print 'single-sample test'
out1 = convolve_3d_slow(d1[0],k1,b1,1,1)
out2 = convolve_3d_fast(d1[0],k1,b1,1,1)
out3 = convolve_3d_faster(d1[0],k1,b1,1,1)
out4 = test_fft(d1[0],k1,b1,1,1)
print np.allclose(out1,out2),
print np.allclose(out1,out3),
print np.allclose(out1,out4)
%timeit convolve_3d_slow(d1[0],k1,b1,1,1)
%timeit convolve_3d_fast(d1[0],k1,b1,1,1)
%timeit convolve_3d_faster(d1[0],k1,b1,1,1)
%timeit test_fft(d1[0],k1,b1,1,1)
#print 'loop test'
#out_loop1 = test_loop(d1,k1,b1,1,1)
#print np.allclose(out1,out_loop1[0])
#%timeit test_loop(d1,k1,b1,1,1)
In [32]:
def max_pool(x):
"""Return maximum in groups of 2x2 for a d,h,w image"""
#source: http://stackoverflow.com/a/31975845
d,h,w = x.shape
x = x.reshape(d,h/2,2,w/2,2).swapaxes(2,3).reshape(d,h/2,w/2,4)
return np.amax(x,axis=3)
def softmax(z):
c = -np.max(z)
tmp = np.exp(z + c)
total = np.sum(tmp)
return tmp/total
In [33]:
def backprop_f_RELU(grad_f_POOL,f_POOL_1batch,f_RELU_1batch):
"""upsample f_POOL then backprop to f_RELU """
tmp1 = np.repeat(grad_f_POOL,2,axis=1) #expand grad down
tmp1 = np.repeat(tmp1,2,axis=2) #expand grad right
tmp2 = np.repeat(f_POOL_1batch,2,axis=1) #expand f_POOL down
tmp2 = np.repeat(tmp2,2,axis=2) #expand f_POOL right
mask = f_RELU_1batch == tmp2
return mask*tmp1
def backprop_kernel(grad_f_CONV,x_arr):
""" sum grad up to (12,3,3,3) per kernel dimension """
c_img,h_img,w_img = x_arr.shape
# (12, 32, 32) and (3,32,32) -> (12,3,32,32)
tmp = np.einsum('ijk,ljk->iljk',grad_f_CONV,x_arr)
grad_kernel = np.zeros(kernel_size)
for k in range(0,kernel_size[0]): # no. of filter = 12
for c in range(0,kernel_size[1]): # RGB = 3
for h in range(0,kernel_size[2]): # height = 3
for w in range(0,kernel_size[3]): # width = 3
#sum relevant portion (with padding = 1)
h_from = max(0,0-1+h)
h_to = min(h_img,h_img-kernel_size[2]+2+h)
w_from = max(0,0-1+w)
w_to = min(w_img,w_img-kernel_size[3]+2+w)
grad_kernel[k,c,h,w] = np.sum(tmp[k,c,h_from:h_to,w_from:w_to])
return grad_kernel
In [34]:
import math
def loss_func(z,y):
z = z - np.max(z)
nom = np.sum(y*z)
denom = math.log(np.sum(np.exp(z)))
return -nom + denom
In [35]:
def forward_pass(X,kernel,bias_kernel,w_SOFT,bias_w_SOFT,Y,input_index,cache_flag):
f_CONV,f_RELU,f_POOL,f_OUT = [],[],[],[]
calc_loss = 0.0
input_size = len(input_index)
for k in input_index:
f_conv = convolve_3d_faster(X[k,:,:,:],kernel,bias_kernel,1,1)
f_relu = np.maximum(f_conv,0)
f_pool = max_pool(f_relu)
z = np.einsum('ijkl,jkl->i',w_SOFT,f_pool) + bias_w_SOFT
calc_loss += loss_func(z,Y[k])/len(input_index)
if cache_flag == 'Y':
f_CONV.append(f_conv)
f_RELU.append(f_relu)
f_POOL.append(f_pool)
f_OUT.append(softmax(z))
return f_CONV,f_RELU,f_POOL,f_OUT,calc_loss
In [37]:
#All Settings before running the program
from operator import mul
#build kernel = [Filter, RGB, Height, Width]
kernel = np.random.standard_normal(kernel_size)*math.sqrt(2.0/reduce(mul,kernel_size))
w_SOFT = np.random.standard_normal(weight_size)*math.sqrt(2.0/reduce(mul,weight_size))
bias_kernel = np.random.uniform(-0.01,0.01,size=12)
bias_w_SOFT = np.random.uniform(-0.01,0.01,size=10)
loss,loss_test = [],[]
#mini-batch setting, we need to cache some calculation for reuse later
mini_batch_size = min(100,int(0.2*train_size))
#learning rate for gradient descent
w_SOFT_learn_rate = 0.02
kernel_learn_rate = 0.04
momentum_rate = 0.9
epoch_size = 30
In [38]:
#Algo starts here
#shuffle_box = np.random.choice(train_size,size=train_size,replace=False)
prev_w_SOFT_update,prev_kernel_update = 0.0 , 0.0
start_time = time.time()
#forward pass
#INPUT --> CONV --> RELU --> POOL --> OUT (Softmax)
f_CONV,f_RELU,f_POOL,f_OUT,curr_loss = \
forward_pass(X_train,kernel,bias_kernel,w_SOFT,bias_w_SOFT,Y_train,range(0,train_size),'N')
_,_,_,f_OUT_test,curr_loss_test = \
forward_pass(X_test,kernel,bias_kernel,w_SOFT,bias_w_SOFT,Y_test,range(0,test_size),'N')
loss.append(curr_loss)
loss_test.append(curr_loss_test)
for e in range(0,epoch_size):
#print 'epoch ' + str(e) + ','
for m in range(0,train_size,mini_batch_size):
cache_index = range(m,min(train_size,m+mini_batch_size))
#print str(m) + ',',
#test_index = range(0,train_size)
f_CONV,f_RELU,f_POOL,f_OUT,curr_loss = \
forward_pass(X_train,kernel,bias_kernel,\
w_SOFT,bias_w_SOFT,Y_train,cache_index,'Y')
#backpropagation
#kernel <-- CONV <-- RELU <-- POOL <-- OUT (Softmax)
grad_w_SOFT = np.zeros((10,12,16,16))
grad_kernel = np.zeros((12,3,3,3))
grad_bias_kernel = np.zeros(12)
grad_bias_w_SOFT = np.zeros(10)
#for batch in range(m,min(train_size,m+mini_batch_size)):
for batch in range(0,len(cache_index)):
grad_w_SOFT += np.einsum('l,ijk->lijk',(f_OUT[batch] - Y_train[m+batch]),f_POOL[batch])
grad_f_POOL = np.einsum('l,lijk->ijk',(f_OUT[batch] - Y_train[m+batch]),w_SOFT)
grad_f_RELU = backprop_f_RELU(grad_f_POOL,f_POOL[batch],f_RELU[batch])
grad_f_CONV = (f_CONV[batch] >0)*grad_f_RELU
grad_kernel += backprop_kernel(grad_f_CONV,X_train[m+batch,:,:,:])
grad_bias_kernel += np.sum(grad_f_CONV,axis=(1,2))
grad_bias_w_SOFT += (f_OUT[batch] - Y_train[m+batch])
#update kernel and weight
learn_decay = (epoch_size - e*0.5)/(epoch_size)
w_SOFT_update = - w_SOFT_learn_rate*learn_decay*grad_w_SOFT/mini_batch_size
kernel_update = - kernel_learn_rate*learn_decay*grad_kernel/mini_batch_size
bias_w_SOFT_update = - kernel_learn_rate*learn_decay*grad_bias_w_SOFT/mini_batch_size
bias_kernel_update = - kernel_learn_rate*learn_decay*grad_bias_kernel/mini_batch_size
w_SOFT = w_SOFT + w_SOFT_update + momentum_rate*prev_w_SOFT_update
kernel = kernel + kernel_update + momentum_rate*prev_kernel_update
bias_w_SOFT = bias_w_SOFT + bias_w_SOFT_update
bias_kernel = bias_kernel + bias_kernel_update
prev_w_SOFT_update, prev_kernel_update = w_SOFT_update, kernel_update
if e % (epoch_size/5.0) == 0: #report progress 20% each
#a1 = np.sum(np.abs(w_SOFT_update))
#a2 = np.sum(np.abs(kernel_update))
#print 'w_SOFT_update = {:.4f}, kernel_update = {:.4f}'.format(a1,a2)
print '{:.2f}%...'.format(100.0*e/epoch_size),
_,_,_,_,curr_loss = \
forward_pass(X_train,kernel,bias_kernel,w_SOFT,bias_w_SOFT,\
Y_train,range(0,train_size),'N')
_,_,_,_,curr_loss_test = \
forward_pass(X_test,kernel,bias_kernel,w_SOFT,bias_w_SOFT,\
Y_test,range(0,test_size),'N')
loss.append(curr_loss)
loss_test.append(curr_loss_test)
print 'run time = {:.2f}s'.format((time.time()-start_time))
df_run = pd.DataFrame(data=np.array([loss,loss_test]).T,columns=['train','test'])
df_run.plot()
axes = plt.gca()
In [39]:
from sklearn.metrics import confusion_matrix
def show_confusion_matrix(cm_mat):
accuracy = np.trace(cm_mat)*100.0/test_size
print 'Test set Accuracy = {:.2f}%'.format(accuracy)
df_cm = pd.DataFrame(cm_mat,index=label_textno[0:cm_mat.shape[0]],columns=label_textno[0:cm_mat.shape[1]])
plt.figure(figsize = (8,6),dpi=600)
sns.heatmap(df_cm, cbar=True ,annot=True, fmt=',.0f')
plt.title('Confusion Matrix')
plt.xlabel('Truth')
plt.ylabel('Predicted')
#get lastest prediction
_,_,_,f_OUT_test,curr_loss_test = \
forward_pass(X_test,kernel,bias_kernel,w_SOFT,bias_w_SOFT,\
Y_test,range(0,test_size),'Y')
#get the prediction to compare with target
label_text = ["plane","car","bird","cat","deer","dog","frog","horse","ship","truck"]
label_textno = label_text
for l in range(0,len(label_text)):
label_textno[l] = str(l) + ' ' + label_text[l]
y_pred = [np.argmax(r) for r in f_OUT_test]
y_truth = np.array(cifar['labels'])[test_ix]
cm_mat = confusion_matrix(y_truth,y_pred)
show_confusion_matrix(cm_mat)
The CNN has no problem segregating big groupings like animals vs machines. It has problem telling aparts the sub-category like -- (plane,truck,ship,car) for machine and (deer,cat,frog,bird,horse) for animals.
Since we only set up 1-layer process for convolution NN, this maybe the limit. Further training will generally improve the training set performance but worsen the test set performance.
In [330]:
#test out completely random selections.
# as expected, the accuracy is always around 1/(no. of categories)
print y_truth.shape
rt = np.random.randint(0,10,size=(y_truth.shape[0]))
cm_mat = confusion_matrix(y_truth,rt)
show_confusion_matrix(cm_mat)
In [ ]: