CNN from scratch

In this notebook, we're going to build a convolutional neural network for recognizing handwritten digits from scratch. By from scratch, I mean without using tensorflow's almighty neural network functions like tf.nn.conv2d. This way, you'll be able to uncover the blackbox and understand how CNN works more clearly. We'll use tensorflow interactively, so you can check the intermediate results along the way. This will also help your understanding.

Outline

Here are some functions we will implement from scratch in this notebook.

Convolutional layer
ReLU
Max Pooling
Affine layer (Fully connected layer)
Softmax
Cross entropy error

First things first, let's import TensorFlow



In [1]:

    
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))









    



TensorFlow Version: 1.4.1
Default GPU Device: /device:GPU:0

These two lines of code will download and read in the handwritten digits data automatically.



In [2]:

    
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/home/arasdar/datasets/MNIST_data/", one_hot=True, reshape=False)









    



Extracting /home/arasdar/datasets/MNIST_data/train-images-idx3-ubyte.gz
Extracting /home/arasdar/datasets/MNIST_data/train-labels-idx1-ubyte.gz
Extracting /home/arasdar/datasets/MNIST_data/t10k-images-idx3-ubyte.gz
Extracting /home/arasdar/datasets/MNIST_data/t10k-labels-idx1-ubyte.gz

We're going to look at only 100 examples at a time.



In [3]:

    
batch_size = 100

Here is the first example of data. It's a representation of a picture as a bunch of numbers.



In [4]:

    
example_X, example_ys = mnist.train.next_batch(batch_size)
example_X[0].shape









    Out[4]:





(28, 28, 1)

We use the convenient InteractiveSession, for checking the intermediate results along the way. You can now use Tensor.eval() and Operation.run() without having to specify a session explicitly.



In [5]:

    
session = tf.InteractiveSession()

We start building the computation graph by creating placeholders for the input images(X) and target output labels(t).



In [6]:

    
X = tf.placeholder('float', [batch_size, 28, 28, 1])
t = tf.placeholder('float', [batch_size, 10])

Below is an overview of the model we will build. It starts with a convolutional layer, pass the result to ReLU, pool, affine layer, ReLU again, second affine layer, and then apply softmax function. Keep in mind this architecture while you're following the notebook.

$$ conv - relu - pool - affine - relu - affine - softmax$$

Convolutional layer



In [7]:

    
filter_h, filter_w, filter_c, filter_n = 5, 5, 1, 30



In [8]:

    
W1 = tf.Variable(tf.random_normal([filter_h, filter_w, filter_c, filter_n], stddev=0.01))
b1 = tf.Variable(tf.zeros([filter_n]))



In [9]:

    
def convolution(X, W, b, padding, stride):
    n, h, w, c = map(lambda d: d.value, X.get_shape())
    filter_h, filter_w, filter_c, filter_n = [d.value for d in W.get_shape()]
    
    out_h = (h + 2*padding - filter_h)//stride + 1
    out_w = (w + 2*padding - filter_w)//stride + 1

    X_flat = flatten(X, filter_h, filter_w, filter_c, out_h, out_w, stride, padding)
    W_flat = tf.reshape(W, [filter_h*filter_w*filter_c, filter_n])
    
    z = tf.matmul(X_flat, W_flat) + b     # b: 1 X filter_n
    
    return tf.transpose(tf.reshape(z, [out_h, out_w, n, filter_n]), [2, 0, 1, 3])

To compute convolution easily, we do a simple trick called flattening. After flattening, input data will be transformed into a 2D matrix, which allows for matrix multiplication with a filter (which is also flattened into 2D).



In [10]:

    
def flatten(X, window_h, window_w, window_c, out_h, out_w, stride=1, padding=0):
    
    X_padded = tf.pad(X, [[0,0], [padding, padding], [padding, padding], [0,0]])

    windows = []
    for y in range(out_h):
        for x in range(out_w):
            window = tf.slice(X_padded, [0, y*stride, x*stride, 0], [-1, window_h, window_w, -1])
            windows.append(window)
    stacked = tf.stack(windows) # shape : [out_h, out_w, n, filter_h, filter_w, c]

    return tf.reshape(stacked, [-1, window_c*window_w*window_h])



In [15]:

    
print(X.shape, X.dtype, W1.shape, W1.dtype, b1.shape, b1.dtype)









    



(100, 28, 28, 1) <dtype: 'float32'> (5, 5, 1, 30) <dtype: 'float32_ref'> (30,) <dtype: 'float32_ref'>



In [17]:

    
conv_layer = convolution(X, W1, b1, padding=2, stride=1)
conv_layer









    Out[17]:





<tf.Tensor 'transpose_1:0' shape=(100, 28, 28, 30) dtype=float32>

ReLU



In [18]:

    
def relu(X):
    return tf.maximum(X, tf.zeros_like(X))



In [20]:

    
conv_activation_layer = relu(conv_layer)
conv_activation_layer









    Out[20]:





<tf.Tensor 'Maximum_1:0' shape=(100, 28, 28, 30) dtype=float32>

Max pooling



In [21]:

    
def max_pool(X, pool_h, pool_w, padding, stride):
    n, h, w, c = [d.value for d in X.get_shape()]
    
    out_h = (h + 2*padding - pool_h)//stride + 1
    out_w = (w + 2*padding - pool_w)//stride + 1

    X_flat = flatten(X, pool_h, pool_w, c, out_h, out_w, stride, padding)

    pool = tf.reduce_max(tf.reshape(X_flat, [out_h, out_w, n, pool_h*pool_w, c]), axis=3)
    return tf.transpose(pool, [2, 0, 1, 3])



In [22]:

    
pooling_layer = max_pool(conv_activation_layer, pool_h=2, pool_w=2, padding=0, stride=2)
pooling_layer









    Out[22]:





<tf.Tensor 'transpose_2:0' shape=(100, 14, 14, 30) dtype=float32>

Affine layer 1



In [23]:

    
batch_size, pool_output_h, pool_output_w, filter_n = [d.value for d in pooling_layer.get_shape()]



In [24]:

    
# number of nodes in the hidden layer
hidden_size = 100



In [25]:

    
W2 = tf.Variable(tf.random_normal([pool_output_h*pool_output_w*filter_n, hidden_size], stddev=0.01))
b2 = tf.Variable(tf.zeros([hidden_size]))



In [26]:

    
def affine(X, W, b):
    n = X.get_shape()[0].value # number of samples
    X_flat = tf.reshape(X, [n, -1])
    return tf.matmul(X_flat, W) + b



In [27]:

    
affine_layer1 = affine(pooling_layer, W2, b2)
affine_layer1









    Out[27]:





<tf.Tensor 'add_2:0' shape=(100, 100) dtype=float32>



In [28]:

    
init = tf.global_variables_initializer()
init.run()
affine_layer1.eval({X:example_X, t:example_ys})[0]









    Out[28]:





array([ 0.00925174, -0.00247716,  0.00373842, -0.01411869,  0.01808271,
       -0.00904754, -0.00638073, -0.00591116, -0.01707015, -0.0120058 ,
       -0.03524879, -0.00075297, -0.02764303, -0.00427013, -0.00041813,
       -0.014628  , -0.01604748,  0.01305443,  0.00531883, -0.0068157 ,
        0.01700079, -0.00695998,  0.01047445,  0.00686595,  0.00277898,
       -0.01327773,  0.02326751,  0.00105084, -0.00554805,  0.00553582,
        0.00587317, -0.03639751, -0.01127398, -0.01640638, -0.00076795,
       -0.01045864,  0.04147588, -0.01156281,  0.02095137, -0.00906414,
       -0.00811423,  0.00924253, -0.01614905, -0.00712552, -0.01189603,
       -0.00500245,  0.02677673, -0.03063465,  0.01832008, -0.02034824,
        0.00077786,  0.00186712,  0.02998282,  0.02252693, -0.02051033,
        0.00244616, -0.00651262, -0.00623711,  0.01035882, -0.00499087,
        0.01756883, -0.00748296,  0.01257539, -0.02312755,  0.01175804,
       -0.02441126,  0.0081053 , -0.04168748,  0.01337018, -0.01475471,
        0.01009009,  0.00202592,  0.01153922, -0.0029256 , -0.00095263,
        0.01502997,  0.00875417,  0.02194441, -0.0025549 , -0.01146321,
       -0.01528439,  0.03253143,  0.00252969,  0.01415024, -0.01252698,
       -0.04096382, -0.0087318 ,  0.04963036,  0.000627  , -0.01366708,
       -0.00637358, -0.02686524, -0.00803551, -0.0043704 ,  0.00203722,
       -0.02695317, -0.02998732, -0.00332387,  0.01216408,  0.01784112],
      dtype=float32)

The above result shows the representation of the first example as a 100 dimention vector in the hidden layer.



In [29]:

    
affine_activation_layer1 = relu(affine_layer1)
affine_activation_layer1









    Out[29]:





<tf.Tensor 'Maximum_2:0' shape=(100, 100) dtype=float32>



In [30]:

    
affine_activation_layer1.eval({X:example_X, t:example_ys})[0]









    Out[30]:





array([0.00925174, 0.        , 0.00373842, 0.        , 0.01808271,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.01305443, 0.00531883, 0.        ,
       0.01700079, 0.        , 0.01047445, 0.00686595, 0.00277898,
       0.        , 0.02326751, 0.00105084, 0.        , 0.00553582,
       0.00587317, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.04147588, 0.        , 0.02095137, 0.        ,
       0.        , 0.00924253, 0.        , 0.        , 0.        ,
       0.        , 0.02677673, 0.        , 0.01832008, 0.        ,
       0.00077786, 0.00186712, 0.02998282, 0.02252693, 0.        ,
       0.00244616, 0.        , 0.        , 0.01035882, 0.        ,
       0.01756883, 0.        , 0.01257539, 0.        , 0.01175804,
       0.        , 0.0081053 , 0.        , 0.01337018, 0.        ,
       0.01009009, 0.00202592, 0.01153922, 0.        , 0.        ,
       0.01502997, 0.00875417, 0.02194441, 0.        , 0.        ,
       0.        , 0.03253143, 0.00252969, 0.01415024, 0.        ,
       0.        , 0.        , 0.04963036, 0.000627  , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.00203722,
       0.        , 0.        , 0.        , 0.01216408, 0.01784112],
      dtype=float32)

This is after applying ReLU to the above representation. You can see that we set all the negative numbers to 0.

Affine layer 2



In [31]:

    
output_size = 10



In [33]:

    
W3 = tf.Variable(tf.random_normal([hidden_size, output_size], stddev=0.01))
b3 = tf.Variable(tf.zeros([output_size]))
W3, b3









    Out[33]:





(<tf.Variable 'Variable_6:0' shape=(100, 10) dtype=float32_ref>,
 <tf.Variable 'Variable_7:0' shape=(10,) dtype=float32_ref>)



In [34]:

    
affine_layer2 = affine(affine_activation_layer1, W3, b3)



In [35]:

    
# because you have new variables, you need to initialize them.
init = tf.global_variables_initializer()
init.run()



In [36]:

    
affine_layer2.eval({X:example_X, t:example_ys})[0]









    Out[36]:





array([ 0.00182686,  0.00224067,  0.00030621, -0.00095036,  0.00034657,
       -0.00138684,  0.00118214,  0.00020053, -0.00027152,  0.00456122],
      dtype=float32)

Softmax



In [37]:

    
def softmax(X):
    X_centered = X - tf.reduce_max(X) # to avoid overflow
    X_exp = tf.exp(X_centered)
    exp_sum = tf.reduce_sum(X_exp, axis=1)
    return tf.transpose(tf.transpose(X_exp) / exp_sum)



In [39]:

    
softmax_layer = softmax(affine_layer2)
softmax_layer









    Out[39]:





<tf.Tensor 'transpose_6:0' shape=(100, 10) dtype=float32>



In [40]:

    
softmax_layer.eval({X:example_X, t:example_ys})[0]









    Out[40]:





array([0.10010205, 0.10014348, 0.09994995, 0.09982443, 0.09995398,
       0.09978087, 0.10003753, 0.09993938, 0.09989222, 0.10037614],
      dtype=float32)

We got somewhat evenly distributed probabilities over 10 digits. This is as expected because we haven't trained our model yet.

Cross entropy error



In [41]:

    
def cross_entropy_error(y, t):
    return -tf.reduce_mean(tf.log(tf.reduce_sum(y * t, axis=1)))



In [42]:

    
loss = cross_entropy_error(softmax_layer, t)



In [43]:

    
loss.eval({X:example_X, t:example_ys})









    Out[43]:





2.3026032



In [44]:

    
learning_rate = 0.1
trainer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)



In [45]:

    
# number of times to iterate over training data
training_epochs = 2



In [46]:

    
# number of batches
num_batch = int(mnist.train.num_examples/batch_size)
num_batch









    Out[46]:





550



In [61]:

    
501%10









    Out[61]:





1



In [63]:

    
sess = tf.Session()
sess.run(tf.global_variables_initializer())

for i in range(500):
    X_mb, y_mb = mnist.train.next_batch(batch_size)

    _, loss_val = sess.run([trainer, loss], feed_dict={X: X_mb, t: y_mb})
    avg_cost = loss_val / num_batch #.eval(feed_dict={X:X_mb, t:y_mb})
    
    # Every 1000 iterations = if remaining of or mode of i on 1000 is zero or i is the multiplication of 1000
    # Print out the results
    if i%100 == 0:
        print(avg_cost)

#     if i % 100 == 0:
#         y_pred = sess.run(forward_step, feed_dict={X: X_val})
#         acc = accuracy(y_val, y_pred)

#         print('Iter: {} Loss: {:.4f} Validation: {}'.format(i, loss_val, acc))









    



0.004186583432284268
0.00411091457713734
0.0010877627676183527
0.0008315690539099954
0.0009691767259077592



In [53]:

    
avg_cost = 0
for i in range(50):
    train_X, train_ys = mnist.train.next_batch(100)
    trainer.run(feed_dict={X:train_X, t:train_ys})
    avg_cost = loss.eval(feed_dict={X:train_X, t:train_ys}) / num_batch
    print(avg_cost)


#     #         if net_type == 'cnn':
#     X_mb = X_mb.reshape([-1, 28, 28, 1])

#     _, loss_val = sess.run([train_step, loss], feed_dict={X: X_mb, y: y_mb})

#     if i % 100 == 0:
#         y_pred = sess.run(forward_step, feed_dict={X: X_val})
#         acc = accuracy(y_val, y_pred)

#         print('Iter: {} Loss: {:.4f} Validation: {}'.format(i, loss_val, acc))









    



0.0002530513026497581
0.0005116122961044312
0.00015374977480281482
0.00024025369774211537
0.0002687355875968933
0.00021916113116524435
0.0004313530434261669
0.00013656723228367893
0.00027604070576754485
0.0002509538422931324
0.00026693642139434813
0.0004813069647008722
0.00024193882942199707
0.00022291856733235447
0.00028267844156785446
0.0002987148545005105
0.00020740299062295392
0.00031692970882762564
0.0002276279709555886
0.00022814975543455644
0.00013362857428464022
0.00020789036696607416
0.00018464168364351445
0.00041824812238866634
0.0003933531045913696
0.0002731123566627502
0.00023209579966285013
0.0002516916394233704
0.00020734738219868053
0.0002037431841546839
0.0002385480837388472
0.0001330316202207045
0.00036589893427762117
0.00029510717500339856
0.00017985174601728267
0.00030295583334836096
0.0002563689784570174
0.00021367035128853538
0.00028936746445569126
0.00030755351890217173
0.00022703165357763116
0.00020914259282025423
0.0001804633167656985
0.00016635522246360779
0.00019020209258252923
0.00011095401915636929
0.00024563537402586504
0.000314336744221774
0.00040946258739991623
0.00014553071423010393



In [38]:

    
from tqdm import tqdm_notebook



In [39]:

    
for epoch in range(training_epochs):
    avg_cost = 0
    for _ in tqdm_notebook(range(num_batch)):
        train_X, train_ys = mnist.train.next_batch(batch_size)
        trainer.run(feed_dict={X:train_X, t:train_ys})
        avg_cost += loss.eval(feed_dict={X:train_X, t:train_ys}) / num_batch

    print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(avg_cost), flush=True)









    





 
 










    



Epoch: 0001 cost= 0.801890662






    





 
 










    



Epoch: 0002 cost= 0.109185281



In [41]:

    
test_x = mnist.test.images[:batch_size]
test_t = mnist.test.labels[:batch_size]



In [46]:

    
def accuracy(network, t):
    
    t_predict = tf.argmax(network, axis=1)
    t_actual = tf.argmax(t, axis=1)

    return tf.reduce_mean(tf.cast(tf.equal(t_predict, t_actual), tf.float32))



In [47]:

    
accuracy(softmax_layer, t).eval(feed_dict={X:test_x, t:test_t})









    Out[47]:





0.98000002

We got an accuracy of 98%. Awesome!



In [48]:

    
session.close()

dreamgonfly@gmail.com