Dataset - CIFAR10

Let's first prepare the data from CIFAR10 dataset.

Source: http://www.cs.utoronto.ca/~kriz/cifar.html

The CIFAR-10 dataset consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class.

(we will be using batch_1 only out of 6 batches -- so 10,000 images in total)

The classes are ["airplane","automobile","bird","cat","deer","dog","frog","horse","ship","truck"]

no. of Classes 10
Samples per class ~1,000
Samples total 10,000
Dimensionality 32x32x3 = 3,072
Features int 0-255 each pixel

In [1]:
import pandas as pd
import numpy as np
from random import randint
from pandas import Series,DataFrame
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
import math
import time
%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

In [2]:
def unpickle(file):
    import cPickle
    fo = open(file, 'rb')
    dict = cPickle.load(fo)
    fo.close()
    return dict

In [4]:
# source: http://www.cs.utoronto.ca/~kriz/cifar.html
# get the training data, batch 1 out of 6. do unzip the tar.gz first
from time import strftime
import os

if os.name == 'nt': # Windows
    path = 'cifar\\data_batch_1'
else:
    path = 'cifar/data_batch_1'
    
cifar = unpickle('cifar/data_batch_1')

In [5]:
def array_to_image(arr):
    im = arr.reshape((32,32,3),order='F')/255.0
    im = np.transpose(im,axes=(1,0,2))
    return im

In [6]:
# image data is uint8 array format 
# 3 RGB channels and 32x32 pixels means 32x32x3 = 1024x3 = 3072 numbers in one array
# 10,000 arrays in dataset
lucky_num = randint(0,10000)
sample_arr = cifar["data"][lucky_num].copy()
print 'array_size = {}'.format(len(sample_arr))
print sample_arr
print 'file_name = {}'.format(cifar["filenames"][lucky_num])
#show image
plt.rcParams['figure.figsize'] = (4, 4)
plt.imshow(array_to_image(sample_arr))
plt.show()


array_size = 3072
[5 5 5 ..., 0 2 2]
file_name = jumbo_jet_s_001329.png

Convolutional Neural Networks (CNN)

Source1: http://cs231n.github.io/convolutional-networks/

Source2: http://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html

Note: Most of the knowledge below was gleaned from CS231n stanford course (Source1), its figures and explanation are excellent learning materials.

Convolutional Neural Networks (CNN) is essential normal neural nets with the "Convolution" part to transform raw input into meaningful features. Most common image, video, or other "dense" input problem.

What is convolution anyways?

The math definition is: "the integral of 2 products after one is reversed and shifted."

$$ (f * g)(t) = \int_{-\infty}^{\infty} f(x) \cdot g(t - x) \; dx $$

This image from wiki is much easier to understand.

In matrix form, below is how to convolve an input matrix with another matrix (called a kernel).

The convolution is essentially an element-wise multiplication and then sum of the result.

Note: the kernel needed to be flipped (in every dimension) before doing the multiplication.

Here we have a 5x5 (green) input matrix and a 3x3 (orange) kernel, doing a 1-each stride (moving the kernel 1 cell a a time).

Even more useful is to look at how it was applied in real application.

The most (visually) obvious is as a filter in photo editor -- Edge detection, Blur, etc. The kernel is mostly a mask (cell value either 1 or 0) to enhance or hide away features.

So we know what convolution is now (roughly), let's move on to CNN architecture.

CNN architecture

We're going to use the structure from here. The source website actually contains great explanation on many concepts, so I will only do a short summary for our purpose.

Note: For the sake of easier debugging of numpy array print format, I will use this notation [Depth x Height x Width]

  • INPUT - [3x32x32] - will hold the raw pixel values of the image with 3 color channels.
  • CONV - [12x32x32] - this layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as if we decided to use 12 filters. (with each filter as [3x3x3])
  • RELU - [12x32x32] - layer will apply an elementwise activation function, such as the max(0,x) thresholding at zero. This leaves the size of the volume unchanged.
  • POOL - [12x16x16] - layer will perform a downsampling operation along the spatial dimensions (width, height).
  • FC - [10x1x1] - (i.e. fully-connected) layer will compute the class scores, where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10.

Let's get into details of each layer.

INPUT layer

Not much to be said here. we will use an image matrix of [RGB,Height,Width] = [3,32,32].

CONV layer

First, we will use a filter (aka. kernel) matrix of RGBx3x3.

In terms of neurons, the no. of weight (connection) for each filter is 3x3x3 + 1 bias = 28.

Since we will have 12 filter slices (also called "depth"), we will have 28x12 = 336 connections total for CONV layer.

The formula belows calculate the shape of the output matrix.

For input Width (or height) = 32, Filter width = 3, zero-Padding = 1, and Stride = 1 :

$$ \begin{align} {Output}& = \dfrac{(W + 2 \cdot P - F)}{S} + 1 \\ {Output}& = \dfrac{(32 + 2 \cdot 1 - 3)}{1} + 1 = 32 \end{align} $$

So our output matrix from this layer is [32x32x12].

RELU layer

we will apply a Rectified Linear Unit (ReLU) function here. The output will be the same dimension of [32x32x12].

$$ f(x) = max(0,x) $$

Or put it another way:

$$ f(x) = \begin{cases} x & \text{if } x \gt 0 \\ 0 & \text{Otherwise} \end{cases} $$

POOL layer

Downsampling from [32x32x12] to [16x16x12].

We are going to use a max function of 2x2 matrix inside the 32x32 layer, therefore the output will be halved -- to [16x16x12].

FC layer

This is a normal NN output layer, connecting all [16x16x12] input to [1x1x10] predicted output.

We are going to use Softmax function to constrain all the probability output.

$$ z_i = w_{soft} \cdot f_{POOL,i} $$$$ \sigma_{soft}(z_i) = \cfrac{e^{z_{i}}}{ \sum_{j=0}^m e^{z_{j}} } $$

However, there is a chance that $ e^{z}$ becomes very large, which could make the computer croak by trying to divide two large numbers.

Therefore, we can add a constant $C$ to both numerator and divisor, so the output magnitude is reasonable.

$$ \sigma_{soft}(z_i) = \cfrac{e^{z_i + C}}{ \sum_{j=0}^m e^{z_j + C} } $$$$ C = -\max z $$

Loss function

Base on Softmax Classifier, the (cross-entropy) loss function is like so:

$$ L_i = - y_i \cdot \log( \; \sigma_{soft, i} \;) $$$$ L_i = - y_i \cdot \log( \; \frac {e^{z_{i}}}{\sum e^z} \;) $$$$ L_i = - ( \; z_{y,i} - \log \sum e^z \;) $$$$ L = \frac{1}{N} \sum L_i $$

This roughly mean that for each instance, only use the predicted output ( $ sigma_{soft} $ ) of the correct class for that particular instance. the rest we set to zero.

For total loss over the whole dataset, we can just sum up each loss $L_i$.

Forward Pass

This describes the flow of data.

INPUT --> CONV --> RELU --> POOL --> OUT (Softmax)

Back Propagation

The equations and derivations are from (this slide). Backprop in stages using chain rule is very useful.

For the OUT layer, the Softmax backprop is per below.

$$ \begin{align} \frac{\partial L}{\partial \sigma_{soft}} &= - \frac{y}{\sigma_{soft} } \\[5pt] \frac{\partial L}{\partial z_i} &= - \frac{y_i}{\sigma_{soft,i} } \cdot \frac{\partial \sigma_{soft, i}}{\partial z_i} + \sum_{j \neq i} - \frac{y_j}{\sigma_{soft,j}} \cdot \frac{\partial \sigma_{soft, j}}{\partial f_{POOL,i}} \\[5pt] \frac{\partial \sigma_{soft, i}}{\partial z_i} &= \sigma_{soft,i} \cdot (1 - \sigma_{soft, i}) \quad \text{,where } \sigma_{soft} \text{ is for } i \\[5pt] \frac{\partial \sigma_{soft, j}}{\partial z_i} &= - \sigma_{soft,j} \cdot \sigma_{soft, i} \qquad \text{,where } \sigma_{soft} \text{ is NOT for } i \\[5pt] \frac{\partial L}{\partial z_i} &= \sigma_{soft, i} - y_i \\[5pt] \frac{\partial L}{\partial w_{soft}} &= (\sigma_{soft, i} - y_i) \cdot f_{POOL,i} \quad \text{, for weight update}\\[5pt] \frac{\partial L}{\partial f_{POOL,i}} &= (\sigma_{soft, i} - y_i) \cdot w_{soft} \qquad \text{, for pass-through to the lower layer} \end{align} $$


For the POOL layer, we can pass the gradients directly back to the "max" input, otherwise it is zero.

$$ \frac{\partial L}{\partial f_{RELU,i}} = \begin{cases} \frac{\partial L}{\partial f_{POOL,i}} & \text{if } f_{RELU,i} \text{ is } \max \\ 0 & \text{Otherwise} \end{cases} $$


For the RELU layer, again we can pass through the gradients directly if $ \; f_{CONV} \gt 0 $.

$$ \frac{\partial L}{\partial f_{CONV,i}} = \begin{cases} \frac{\partial L}{\partial f_{RELU,i}} & \text{if } f_{CONV,i} \gt 0 \\ 0 & \text{Otherwise} \end{cases} $$


For the CONV layer, there are a few steps.

First, let's look a an example of convolving the [2x2] kernel $w_{r,c}$ with [3x3] input $x_{r,c}$. The result is a [2x2] $f_{r,c}$ matrix.

Note: $r$ = row index starting at 0 and $c$ = column index starting at 0. like so:

$w_{0,0}$ $w_{0,1}$
$w_{1,0}$ $w_{1,1}$

So for forward-pass with stride = 1, we have all the $f_{r,c}$ like so :

$$ \begin{align} f_{0,0} &= w_{0,0} \cdot x_{0,0} + w_{0,1} \cdot x_{0,1} + w_{1,0} \cdot x_{1,0} + w_{1,1} \cdot x_{1,1} + \text{bias}\\ f_{0,1} &= w_{0,0} \cdot x_{0,1} + w_{0,1} \cdot x_{0,2} + w_{1,0} \cdot x_{1,1} + w_{1,1} \cdot x_{1,2} + \text{bias}\\ f_{1,0} &= w_{0,0} \cdot x_{1,0} + w_{0,1} \cdot x_{1,1} + w_{1,0} \cdot x_{2,0} + w_{1,1} \cdot x_{2,1} + \text{bias}\\ f_{1,1} &= w_{0,0} \cdot x_{1,1} + w_{0,1} \cdot x_{1,2} + w_{1,0} \cdot x_{2,1} + w_{1,1} \cdot x_{2,2} + \text{bias} \end{align} $$

Then we can calculate the gradients -- partial derivative with respect to kernel weight $w$.

$f$ here means $f_{CONV}$, the result from CONV layer:

$$ \begin{align} \frac{\partial f_{0,0}}{\partial w_{0,0}} &= x_{0,0} \\[5pt] \frac{\partial f_{i,j}}{\partial w_{r,c}} &= x_{(i+j),(r+c)} \\[5pt] \frac{\partial L}{\partial w_{r,c}} &= \sum_{i=0,j=0}^{n,m} \frac{\partial L}{\partial f_{i,j}} \cdot \frac{\partial f_{i,j}}{\partial w_{r,c}} \\[5pt] \frac{\partial L}{\partial w_{r,c}} &= \sum_{i=0,j=0}^{n,m} \frac{\partial L}{\partial f_{i,j}} \cdot x_{(i+j),(r+c)} \end{align} $$

For "bias", the partial derivative is 1, so we can just update it with the sum of gradients:

$$ \frac{\partial L}{\partial \text{bias}} = \sum_{i=0,j=0}^{n,m} \frac{\partial L}{\partial f_{i,j}} \cdot 1 $$


Optimization Method

We are going to use a mini-batch Stochastic gradient descent (SGD).

The gradient will have momentum and learning rate will decay based on epoch.

Computational Performance

Since this code is intended to be clear for educational purpose, its performance are far from production level.

I'm trying to use "just" Numpy array and only use other helper libraries as neccessary.


This ends the theory part. Let's start coding!

...

..

.

Coding Starts Here


In [7]:
# X Format is [image_index, RGB, Height, Width]
X = cifar["data"].copy()
X = X.reshape((-1,3,32,32),order='C')/255.0 - 0.5 #standardize

#set size of input, features, hidden, target
instance_size = X.shape[0]
feature_size = X.shape[1]*X.shape[2]*X.shape[3]
target_size = 10
kernel_size = (12,3,3,3)
weight_size = (10,12,16,16) #softmax layer

#make a flat 10 output with all zeros
Y = np.zeros((instance_size,10))
for j in range(0,instance_size):
    Y[j][cifar['labels'][j]] = 1

#split train and test dataset
train_split = 0.8 #6000.0/instance_size #0.8
train_size = int(train_split*instance_size)
test_size = instance_size - train_size
index = np.random.permutation(instance_size)
train_ix, test_ix = index[:train_size], index[train_size:train_size+test_size]
Y_train , Y_test = Y[train_ix,:], Y[test_ix,:]
X_train , X_test = X[train_ix,:,:,:], X[test_ix,:,:,:]

#Xb = np.insert(X_train,0,1,axis=1) #add bias input, always activated
#Xb_test = np.insert(X_test,0,1,axis=1) #add bias input, always activated

In [30]:
def convolve_3d_slow(data,kernel,bias_kernel,stride,pad): 
    #73.8 ms per call --> about 10 min per 8,000 images
    if pad > 0:
        tpad = ((0,0),(pad,pad),(pad,pad))
        data = np.pad(data, pad_width=tpad, mode='constant', constant_values=0)
    k_h = kernel.shape[2]
    k_w = kernel.shape[3]
    len_h = (data.shape[1] - k_h)/stride + 1
    len_w = (data.shape[2] - k_w)/stride + 1
    len_d = kernel.shape[0]
    output = np.zeros((len_d,len_h,len_w))
    kernel_rot = kernel[:,::-1,::-1,::-1]
    for h in range(0,len_h*stride,stride):
        for w in range(0,len_w*stride,stride):      
            for d in range(0,len_d,1):
                output[d,h/stride,w/stride] = np.sum(data[:,h:(h+k_h),w:(w+k_w)]*kernel_rot[d,:,:,:]) + bias_kernel[d]
    return output

import scipy.ndimage as ndi
import scipy.signal as sig
def convolve_3d_fast(data,kernel,bias_kernel,stride,pad):
    #use scipy convolve then sum up
    if pad > 0:
        tpad = ((0,0),(pad,pad),(pad,pad))
        data = np.pad(data, pad_width=tpad, mode='constant', constant_values=0)
    k_h = kernel.shape[2]
    k_w = kernel.shape[3]
    len_h = (data.shape[1] - k_h)/stride + 1
    len_w = (data.shape[2] - k_w)/stride + 1
    len_d = kernel.shape[0]
    output = np.zeros((len_d,len_h,len_w))
    kernel_rot = kernel[:,::-1,:,:]
    # np.rot90(kernel[d,c,:,:],2)
    for d in range(0,len_d,1): # no. of kernel
        for c in range(0,3): # RGB
            output[d,:,:] += sig.convolve2d(data[c,:,:],kernel_rot[d,c,:,:],mode='valid')
        output[d,:,:] += bias_kernel[d]
    return output

def convolve_3d_faster(data,kernel,bias_kernel,stride,pad):
    #only works with stride = 1 and full padding the same size as input data though
    len_h = (data.shape[1] - kernel.shape[2] +2*pad)/stride + 1
    len_w = (data.shape[2] - kernel.shape[3] +2*pad)/stride + 1
    len_d = kernel.shape[0]

    output = np.zeros((len_d,len_h,len_w))
    kernel_rot = kernel[:,::-1,:,:]
    for d in range(0,len_d,1): # no. of kernel
        for c in range(0,3): # RGB 
            output[d,:,:] += ndi.convolve(data[c,:,:],kernel_rot[d,c,:,:],mode='constant',cval=0.0)
        output[d,:,:] += bias_kernel[d]
    return output

def test_loop(data,kernel,bias_kernel,stride,pad):
    #only works with stride = 1 and full padding the same size as input data though
    len_h = (data.shape[2] - kernel.shape[2] +2*pad)/stride + 1
    len_w = (data.shape[3] - kernel.shape[3] +2*pad)/stride + 1
    len_d = kernel.shape[0]
    
    data_size = data.shape[0]
    output = np.zeros((data_size,len_d,len_h,len_w))
    kernel_rot = kernel[:,::-1,:,:]
    
    for i in range(0,data_size): # no. of sample
        for d in range(0,len_d,1): # no. of kernel
            for c in range(0,3): # RGB 
                output[i,d,:,:] += ndi.convolve(data[i,c,:,:],kernel_rot[d,c,:,:],mode='constant',cval=0.0)
            output[i,d,:,:] += bias_kernel[d]    
    return output

def test_fft(data,kernel,bias_kernel,stride,pad):
    #only works with stride = 1 and full padding the same size as input data though
    if pad > 0:
        tpad = ((0,0),(pad,pad),(pad,pad))
        data = np.pad(data, pad_width=tpad, mode='constant', constant_values=0)
    k_h = kernel.shape[2]
    k_w = kernel.shape[3]
    len_h = (data.shape[1] - k_h)/stride + 1
    len_w = (data.shape[2] - k_w)/stride + 1
    len_d = kernel.shape[0]

    output = np.zeros((len_d,len_h,len_w))
            
    for d in range(0,len_d,1): # no. of kernel  
        output[d,:,:] = sig.fftconvolve(data[:,:,:],kernel[d,:,:,:],mode='valid')[0]
        output[d,:,:] += bias_kernel[d]
    return output

In [31]:
#test convolution performance 
k1 = np.arange(0,12*3*(3**2)).reshape(12,3,3,3)
d1 = np.arange(0,1000*3*(32**2)).reshape(1000,3,32,32)
b1 = np.random.uniform(-1,1,size=12)
print 'single-sample test'
out1 = convolve_3d_slow(d1[0],k1,b1,1,1)
out2 = convolve_3d_fast(d1[0],k1,b1,1,1)
out3 = convolve_3d_faster(d1[0],k1,b1,1,1)
out4 = test_fft(d1[0],k1,b1,1,1)
print np.allclose(out1,out2),
print np.allclose(out1,out3),
print np.allclose(out1,out4)
%timeit convolve_3d_slow(d1[0],k1,b1,1,1)
%timeit convolve_3d_fast(d1[0],k1,b1,1,1)
%timeit convolve_3d_faster(d1[0],k1,b1,1,1)
%timeit test_fft(d1[0],k1,b1,1,1)
#print 'loop test'
#out_loop1 = test_loop(d1,k1,b1,1,1)
#print np.allclose(out1,out_loop1[0])
#%timeit test_loop(d1,k1,b1,1,1)


single-sample test
True True True
10 loops, best of 3: 157 ms per loop
100 loops, best of 3: 2.31 ms per loop
1000 loops, best of 3: 1.9 ms per loop
100 loops, best of 3: 7.77 ms per loop

In [32]:
def max_pool(x):
    """Return maximum in groups of 2x2 for a d,h,w image"""
    #source: http://stackoverflow.com/a/31975845
    d,h,w = x.shape
    x = x.reshape(d,h/2,2,w/2,2).swapaxes(2,3).reshape(d,h/2,w/2,4)
    return np.amax(x,axis=3)

def softmax(z):
    c = -np.max(z)
    tmp = np.exp(z + c)
    total = np.sum(tmp)
    return tmp/total

In [33]:
def backprop_f_RELU(grad_f_POOL,f_POOL_1batch,f_RELU_1batch):
    """upsample f_POOL then backprop to f_RELU """
    tmp1 = np.repeat(grad_f_POOL,2,axis=1) #expand grad down
    tmp1 = np.repeat(tmp1,2,axis=2) #expand grad right
    tmp2 = np.repeat(f_POOL_1batch,2,axis=1) #expand f_POOL down
    tmp2 = np.repeat(tmp2,2,axis=2) #expand f_POOL right
    mask = f_RELU_1batch == tmp2
    return mask*tmp1

def backprop_kernel(grad_f_CONV,x_arr):
    """ sum grad up to (12,3,3,3) per kernel dimension """
    c_img,h_img,w_img = x_arr.shape
    # (12, 32, 32) and (3,32,32) -> (12,3,32,32)
    tmp = np.einsum('ijk,ljk->iljk',grad_f_CONV,x_arr)
    grad_kernel = np.zeros(kernel_size)
    for k in range(0,kernel_size[0]): # no. of filter = 12
        for c in range(0,kernel_size[1]): # RGB = 3
            for h in range(0,kernel_size[2]): # height = 3
                for w in range(0,kernel_size[3]): # width = 3
                    #sum relevant portion (with padding = 1)
                    h_from = max(0,0-1+h)
                    h_to = min(h_img,h_img-kernel_size[2]+2+h)
                    w_from = max(0,0-1+w)
                    w_to = min(w_img,w_img-kernel_size[3]+2+w)
                    grad_kernel[k,c,h,w] = np.sum(tmp[k,c,h_from:h_to,w_from:w_to]) 
    return grad_kernel

In [34]:
import math
def loss_func(z,y):
    z = z - np.max(z)
    nom = np.sum(y*z)
    denom = math.log(np.sum(np.exp(z)))
    return -nom + denom

In [35]:
def forward_pass(X,kernel,bias_kernel,w_SOFT,bias_w_SOFT,Y,input_index,cache_flag):
    f_CONV,f_RELU,f_POOL,f_OUT = [],[],[],[]
    calc_loss = 0.0
    input_size = len(input_index)
    for k in input_index:
        f_conv = convolve_3d_faster(X[k,:,:,:],kernel,bias_kernel,1,1)
        f_relu = np.maximum(f_conv,0)
        f_pool = max_pool(f_relu)
        z = np.einsum('ijkl,jkl->i',w_SOFT,f_pool) + bias_w_SOFT
        calc_loss += loss_func(z,Y[k])/len(input_index)
        
        if cache_flag == 'Y':
            f_CONV.append(f_conv)
            f_RELU.append(f_relu)
            f_POOL.append(f_pool)
            f_OUT.append(softmax(z))

    return f_CONV,f_RELU,f_POOL,f_OUT,calc_loss

In [37]:
#All Settings before running the program
from operator import mul

#build kernel = [Filter, RGB, Height, Width]
kernel = np.random.standard_normal(kernel_size)*math.sqrt(2.0/reduce(mul,kernel_size))
w_SOFT = np.random.standard_normal(weight_size)*math.sqrt(2.0/reduce(mul,weight_size))

bias_kernel = np.random.uniform(-0.01,0.01,size=12)
bias_w_SOFT = np.random.uniform(-0.01,0.01,size=10)

loss,loss_test = [],[]

#mini-batch setting, we need to cache some calculation for reuse later
mini_batch_size = min(100,int(0.2*train_size))

#learning rate for gradient descent
w_SOFT_learn_rate = 0.02
kernel_learn_rate = 0.04
momentum_rate = 0.9

epoch_size = 30

In [38]:
#Algo starts here
#shuffle_box = np.random.choice(train_size,size=train_size,replace=False)
prev_w_SOFT_update,prev_kernel_update = 0.0 , 0.0

start_time = time.time()

#forward pass
#INPUT --> CONV --> RELU --> POOL --> OUT (Softmax)
f_CONV,f_RELU,f_POOL,f_OUT,curr_loss = \
    forward_pass(X_train,kernel,bias_kernel,w_SOFT,bias_w_SOFT,Y_train,range(0,train_size),'N')
_,_,_,f_OUT_test,curr_loss_test = \
    forward_pass(X_test,kernel,bias_kernel,w_SOFT,bias_w_SOFT,Y_test,range(0,test_size),'N')
loss.append(curr_loss)
loss_test.append(curr_loss_test)

for e in range(0,epoch_size):
    
    #print 'epoch ' + str(e) + ','
    
    for m in range(0,train_size,mini_batch_size):
        
        cache_index = range(m,min(train_size,m+mini_batch_size))
        #print str(m) + ',',
        #test_index = range(0,train_size)
        
        f_CONV,f_RELU,f_POOL,f_OUT,curr_loss = \
                forward_pass(X_train,kernel,bias_kernel,\
                                     w_SOFT,bias_w_SOFT,Y_train,cache_index,'Y')
     
        #backpropagation
        #kernel <-- CONV <-- RELU <-- POOL <-- OUT (Softmax)
        grad_w_SOFT = np.zeros((10,12,16,16))
        grad_kernel = np.zeros((12,3,3,3))
        grad_bias_kernel = np.zeros(12)
        grad_bias_w_SOFT = np.zeros(10)

        #for batch in range(m,min(train_size,m+mini_batch_size)):
        for batch in range(0,len(cache_index)):
            grad_w_SOFT += np.einsum('l,ijk->lijk',(f_OUT[batch] - Y_train[m+batch]),f_POOL[batch])
            grad_f_POOL = np.einsum('l,lijk->ijk',(f_OUT[batch] - Y_train[m+batch]),w_SOFT)     
            grad_f_RELU = backprop_f_RELU(grad_f_POOL,f_POOL[batch],f_RELU[batch])
            grad_f_CONV = (f_CONV[batch] >0)*grad_f_RELU
            grad_kernel += backprop_kernel(grad_f_CONV,X_train[m+batch,:,:,:])
            grad_bias_kernel += np.sum(grad_f_CONV,axis=(1,2))
            grad_bias_w_SOFT += (f_OUT[batch] - Y_train[m+batch])

        #update kernel and weight
        learn_decay = (epoch_size - e*0.5)/(epoch_size)
        w_SOFT_update = - w_SOFT_learn_rate*learn_decay*grad_w_SOFT/mini_batch_size
        kernel_update = - kernel_learn_rate*learn_decay*grad_kernel/mini_batch_size
        bias_w_SOFT_update = - kernel_learn_rate*learn_decay*grad_bias_w_SOFT/mini_batch_size
        bias_kernel_update = - kernel_learn_rate*learn_decay*grad_bias_kernel/mini_batch_size

        w_SOFT = w_SOFT + w_SOFT_update + momentum_rate*prev_w_SOFT_update
        kernel = kernel + kernel_update + momentum_rate*prev_kernel_update
        bias_w_SOFT = bias_w_SOFT + bias_w_SOFT_update
        bias_kernel = bias_kernel + bias_kernel_update

        prev_w_SOFT_update, prev_kernel_update = w_SOFT_update, kernel_update
    
    if e % (epoch_size/5.0) == 0: #report progress 20% each
        #a1 = np.sum(np.abs(w_SOFT_update))
        #a2 = np.sum(np.abs(kernel_update))
        #print 'w_SOFT_update = {:.4f}, kernel_update = {:.4f}'.format(a1,a2)
        print '{:.2f}%...'.format(100.0*e/epoch_size),
    
    _,_,_,_,curr_loss = \
            forward_pass(X_train,kernel,bias_kernel,w_SOFT,bias_w_SOFT,\
                         Y_train,range(0,train_size),'N')
    _,_,_,_,curr_loss_test = \
            forward_pass(X_test,kernel,bias_kernel,w_SOFT,bias_w_SOFT,\
                         Y_test,range(0,test_size),'N')
    
    loss.append(curr_loss)
    loss_test.append(curr_loss_test)
    
print 'run time = {:.2f}s'.format((time.time()-start_time))    

df_run = pd.DataFrame(data=np.array([loss,loss_test]).T,columns=['train','test'])
df_run.plot()   
axes = plt.gca()


0.00%... 20.00%... 40.00%... 60.00%... 80.00%... run time = 1702.66s

In [39]:
from sklearn.metrics import confusion_matrix
def show_confusion_matrix(cm_mat):
    accuracy = np.trace(cm_mat)*100.0/test_size
    print 'Test set Accuracy = {:.2f}%'.format(accuracy)
    df_cm = pd.DataFrame(cm_mat,index=label_textno[0:cm_mat.shape[0]],columns=label_textno[0:cm_mat.shape[1]])
    plt.figure(figsize = (8,6),dpi=600)
    sns.heatmap(df_cm, cbar=True ,annot=True, fmt=',.0f')
    plt.title('Confusion Matrix')
    plt.xlabel('Truth')
    plt.ylabel('Predicted')

#get lastest prediction
_,_,_,f_OUT_test,curr_loss_test = \
            forward_pass(X_test,kernel,bias_kernel,w_SOFT,bias_w_SOFT,\
                         Y_test,range(0,test_size),'Y')
#get the prediction to compare with target
label_text = ["plane","car","bird","cat","deer","dog","frog","horse","ship","truck"]
label_textno = label_text
for l in range(0,len(label_text)):
    label_textno[l] = str(l) + ' ' + label_text[l]
    
y_pred = [np.argmax(r) for r in f_OUT_test]
y_truth = np.array(cifar['labels'])[test_ix]
cm_mat = confusion_matrix(y_truth,y_pred)
show_confusion_matrix(cm_mat)


Test set Accuracy = 41.20%

The CNN has no problem segregating big groupings like animals vs machines. It has problem telling aparts the sub-category like -- (plane,truck,ship,car) for machine and (deer,cat,frog,bird,horse) for animals.

Since we only set up 1-layer process for convolution NN, this maybe the limit. Further training will generally improve the training set performance but worsen the test set performance.


In [330]:
#test out completely random selections.
# as expected, the accuracy is always around 1/(no. of categories)
print y_truth.shape
rt = np.random.randint(0,10,size=(y_truth.shape[0]))
cm_mat = confusion_matrix(y_truth,rt)
show_confusion_matrix(cm_mat)


(2000L,)
Test set Accuracy = 9.55%

In [ ]: