In this section, we will use the famous MNIST Dataset to build two Neural Networks capable to perform handwritten digits classification. The first Network is a simple Multi-layer Perceptron (MLP) and the second one is a Convolutional Neural Network (CNN from now on). In other words, our algorithm will say, with some associated error, what type of digit is the presented input.

This lesson is not intended to be a reference for *machine learning, convolutions or TensorFlow*. The intention is to give notions to the user about these fields and awareness of Data Scientist Workbench capabilities. We recommend that the students search for further references to understand completely the mathematical and theoretical concepts involved.

- What is Deep Learning
- Simple test: Is tensorflow working?
- 1st part: classify MNIST using a simple model
- Evaluating the final result
- How to improve our model?
- 2nd part: Deep Learning applied on MNIST
- Summary of the Deep Convolutional Neural Network
- Define functions and train the model
- Evaluate the model

**Brief Theory:** Deep learning (also known as deep structured learning, hierarchical learning or deep machine learning) is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using multiple processing layers, with complex structures or otherwise, composed of multiple non-linear transformations.

```
In [1]:
```import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

```
```

Using a different notation, the same digits using one-hot vector representation can be show as:

The imported data can be divided as follow:

- Training (mnist.train) >> Use the given dataset with inputs and related outputs for training of NN. In our case, if you give an image that you know that represents a "nine", this set will tell the neural network that we expect a "nine" as the output.
`- 55,000 data points - mnist.train.images for inputs - mnist.train.labels for outputs`

- Validation (mnist.validation) >> The same as training, but now the date is used to generate model properties (classification error, for example) and from this, tune parameters like the optimal number of hidden units or determine a stopping point for the back-propagation algorithm
`- 5,000 data points - mnist.validation.images for inputs - mnist.validation.labels for outputs`

- Test (mnist.test) >> the model does not have access to this informations prior to the test phase. It is used to evaluate the performance and accuracy of the model against "real life situations". No further optimization beyond this point.
`- 10,000 data points - mnist.test.images for inputs - mnist.test.labels for outputs`

You have two basic options when using TensorFlow to run your code:

- [Build graphs and run session] Do all the set-up and THEN execute a session to evaluate tensors and run operations (ops)
- [Interactive session] create your coding and run on the fly.

For this first part, we will use the interactive session that is more suitable for environments like Jupyter notebooks.

```
In [2]:
```sess = tf.InteractiveSession()

It's a best practice to create placeholders before variable assignments when using TensorFlow. Here we'll create placeholders for inputs ("Xs") and outputs ("Ys").

**Placeholder 'X':** represents the "space" allocated input or the images.

```
* Each input has 784 pixels distributed by a 28 width x 28 height matrix
* The 'shape' argument defines the tensor size by its dimensions.
* 1st dimension = None. Indicates that the batch size, can be of any size.
* 2nd dimension = 784. Indicates the number of pixels on a single flattened MNIST image.
```

**Placeholder 'Y':_** represents the final output or the labels.

```
* 10 possible classes (0,1,2,3,4,5,6,7,8,9)
* The 'shape' argument defines the tensor size by its dimensions.
* 1st dimension = None. Indicates that the batch size, can be of any size.
* 2nd dimension = 10. Indicates the number of targets/outcomes
```

**dtype for both placeholders:** if you not sure, use tf.float32. The limitation here is that the later presented softmax function only accepts float32 or float64 dtypes. For more dtypes, check TensorFlow's documentation here

```
In [3]:
```x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])

```
In [4]:
```# Weight tensor
W = tf.Variable(tf.zeros([784,10],tf.float32))
# Bias tensor
b = tf.Variable(tf.zeros([10],tf.float32))

Please notice that we're using this notation "sess.run" because we previously started an interactive session.

```
In [5]:
```# run the op initialize_all_variables using an interactive session
sess.run(tf.initialize_all_variables())

Illustration showing how weights and biases are added to neurons/nodes.

```
In [6]:
```#mathematical operation to add weights and biases to the inputs
tf.matmul(x,W) + b

```
Out[6]:
```

Softmax is an activation function that is normally used in classification problems. It generate the probabilities for the output. For example, our model will not be 100% sure that one digit is the number nine, instead, the answer will be a distribution of probabilities where, if the model is right, the nine number will have the larger probability.

For comparison, below is the one-hot vector for a nine digit label:

0 --> 0
1 --> 0
2 --> 0
3 --> 0
4 --> 0
5 --> 0
6 --> 0
7 --> 0
8 --> 0
9 --> 1

0 -->.0.1%
1 -->...2%
2 -->...3%
3 -->...2%
4 -->..12%
5 -->..10%
6 -->..57%
7 -->..20%
8 -->..55%
9 -->..80%

```
In [7]:
```y = tf.nn.softmax(tf.matmul(x,W) + b)

```
In [8]:
```cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))

```
In [9]:
```train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

Train using minibatch Gradient Descent.

In practice, Batch Gradient Descent is not often used because is too computationally expensive. The good part about this method is that you have the true gradient, but with the expensive computing task of using the whole dataset in one time. Due to this problem, Neural Networks usually use minibatch to train.

```
In [10]:
```batch = mnist.train.next_batch(50)

```
In [11]:
```batch[0].shape

```
Out[11]:
```

```
In [12]:
```type(batch[0])

```
Out[12]:
```

```
In [13]:
```mnist.train.images.shape

```
Out[13]:
```

```
In [14]:
```#Load 50 training examples for each training iteration
for i in range(1000):
batch = mnist.train.next_batch(50)
train_step.run(feed_dict={x: batch[0], y_: batch[1]})

```
In [15]:
```correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
acc = accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels}) * 100
print("The final accuracy for the simple ANN model is: {} % ".format(acc) )

```
```

```
In [16]:
```sess.close() #finish the session

Is the final result good?

Let's check the best algorithm available out there (10th june 2016):

*Result:* 0.21% error (99.79% accuracy)

Reference here

- Regularization of Neural Networks using DropConnect
- Multi-column Deep Neural Networks for Image Classiﬁcation
- APAC: Augmented Pattern Classification with Neural Networks
- Simple Deep Neural Network with Dropout

- Simple Deep Neural Network with Dropout (more than 1 hidden layer)

In the first part, we learned how to use a simple ANN to classify MNIST. Now we are going to expand our knowledge using a Deep Neural Network.

Architecture of our network is:

- (Input) -> [batch_size, 28, 28, 1] >> Apply 32 filter of [5x5]
- (Convolutional layer 1) -> [batch_size, 28, 28, 32]
- (ReLU 1) -> [?, 28, 28, 32]
- (Max pooling 1) -> [?, 14, 14, 32]
- (Convolutional layer 2) -> [?, 14, 14, 64]
- (ReLU 2) -> [?, 14, 14, 64]
- (Max pooling 2) -> [?, 7, 7, 64]
- [fully connected layer 3] -> [1x1024]
- [ReLU 3] -> [1x1024]
- [Drop out] -> [1x1024]
- [fully connected layer 4] -> [1x10]

The next cells will explore this new architecture.

```
In [58]:
```import tensorflow as tf
# finish possible remaining session
sess.close()
#Start interactive session
sess = tf.InteractiveSession()

```
In [59]:
```from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

```
```

Create general parameters for the model

```
In [60]:
```width = 28 # width of the image in pixels
height = 28 # height of the image in pixels
flat = width * height # number of pixels in one image
class_output = 10 # number of possible classifications for the problem

Create place holders for inputs and outputs

```
In [61]:
```x = tf.placeholder(tf.float32, shape=[None, flat])
y_ = tf.placeholder(tf.float32, shape=[None, class_output])

The input image is a 28 pixels by 28 pixels and 1 channel (grayscale)

In this case the first dimension is the **batch number** of the image (position of the input on the batch) and can be of any size (due to -1)

```
In [62]:
```x_image = tf.reshape(x, [-1,28,28,1])

```
In [63]:
``````
x_image
```

```
Out[63]:
```

Size of the filter/kernel: 5x5;

Input channels: 1 (greyscale);

32 feature maps (here, 32 feature maps means 32 different filters are applied on each image. So, the output of convolution layer would be 28x28x32). In this step, we create a filter / kernel tensor of shape `[filter_height, filter_width, in_channels, out_channels]`

```
In [64]:
```W_conv1 = tf.Variable(tf.truncated_normal([5, 5, 1, 32], stddev=0.1))
b_conv1 = tf.Variable(tf.constant(0.1, shape=[32])) # need 32 biases for 32 outputs

Defining a function to create convolutional layers. To creat convolutional layer, we use **tf.nn.conv2d**. It computes a 2-D convolution given 4-D input and filter tensors.

Inputs:

- tensor of shape [batch, in_height, in_width, in_channels]. x of shape [batch_size,28 ,28, 1]
- a filter / kernel tensor of shape [filter_height, filter_width, in_channels, out_channels]. W is of size [5, 5, 1, 32]
- stride which is [1, 1, 1, 1]

Process:

- change the filter to a 2-D matrix with shape [5*5*1,32]
- Extracts image patches from the input tensor to form a
*virtual*tensor of shape`[batch, 28, 28, 5*5*1]`

. - For each patch, right-multiplies the filter matrix and the image patch vector.

Output:

- A
`Tensor`

(a 2-D convolution) of size <tf.Tensor 'add_7:0' shape=(?, 28, 28, 32)- Notice: the output of the first convolution layer is 32 [28x28] images. Here 32 is considered as volume/depth of the output image.

```
In [65]:
```convolve1= tf.nn.conv2d(x_image, W_conv1, strides=[1, 1, 1, 1], padding='SAME') + b_conv1

**covolve1**, and wherever a negative number occurs,we swap it out for a 0. It is called ReLU activation Function.

```
In [66]:
```h_conv1 = tf.nn.relu(convolve1)

Use the max pooling operation already defined, so the output would be 14x14x32

Defining a function to perform max pooling. The maximum pooling is an operation that finds maximum values and simplifies the inputs using the spacial correlations between them.

**Kernel size:** 2x2 (if the window is a 2x2 matrix, it would result in one output pixel)

**Strides:** dictates the sliding behaviour of the kernel. In this case it will move 2 pixels everytime, thus not overlapping.

```
In [67]:
```h_pool1 = tf.nn.max_pool(h_conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') #max_pool_2x2

```
In [68]:
```layer1= h_pool1

Filter/kernel: 5x5 (25 pixels) ; Input channels: 32 (from the 1st Conv layer, we had 32 feature maps); 64 output feature maps

**Notice:** here, the input is 14x14x32, the filter is 5x5x32, we use 64 filters, and the output of the convolutional layer would be 14x14x64.

**Notice:** the convolution result of applying a filter of size [5x5x32] on image of size [14x14x32] is an image of size [14x14x1], that is, the convolution is functioning on volume.

```
In [69]:
```W_conv2 = tf.Variable(tf.truncated_normal([5, 5, 32, 64], stddev=0.1))
b_conv2 = tf.Variable(tf.constant(0.1, shape=[64])) #need 64 biases for 64 outputs

```
In [70]:
```convolve2= tf.nn.conv2d(layer1, W_conv2, strides=[1, 1, 1, 1], padding='SAME')+ b_conv2

```
In [71]:
```h_conv2 = tf.nn.relu(convolve2)

```
In [72]:
```h_pool2 = tf.nn.max_pool(h_conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') #max_pool_2x2

```
In [73]:
```layer2= h_pool2

So, what is the output of the second layer, layer2?

- it is 64 matrix of [7x7]

Type: Fully Connected Layer. You need a fully connected layer to use the Softmax and create the probabilities in the end. Fully connected layers take the high-level filtered images from previous layer, that is all 64 matrics, and convert them to an array.

So, each matrix [7x7] will be converted to a matrix of [49x1], and then all of the 64 matrix will be connected, which make an array of size [3136x1]. We will connect it into another layer of size [1024x1]. So, the weight between these 2 layers will be [3136x1024]

```
In [74]:
```layer2_matrix = tf.reshape(layer2, [-1, 7*7*64])

```
In [75]:
```W_fc1 = tf.Variable(tf.truncated_normal([7 * 7 * 64, 1024], stddev=0.1))
b_fc1 = tf.Variable(tf.constant(0.1, shape=[1024])) # need 1024 biases for 1024 outputs

```
In [76]:
```fcl3=tf.matmul(layer2_matrix, W_fc1) + b_fc1

```
In [77]:
```h_fc1 = tf.nn.relu(fcl3)

```
In [78]:
```layer3= h_fc1

```
In [79]:
``````
layer3
```

```
Out[79]:
```

```
In [80]:
```keep_prob = tf.placeholder(tf.float32)
layer3_drop = tf.nn.dropout(layer3, keep_prob)

Type: Softmax, Fully Connected Layer.

```
In [81]:
```W_fc2 = tf.Variable(tf.truncated_normal([1024, 10], stddev=0.1)) #1024 neurons
b_fc2 = tf.Variable(tf.constant(0.1, shape=[10])) # 10 possibilities for digits [0,1,2,3,4,5,6,7,8,9]

```
In [82]:
```fcl4=tf.matmul(layer3_drop, W_fc2) + b_fc2

```
In [83]:
```y_conv= tf.nn.softmax(fcl4)

```
In [84]:
```layer4= y_conv

```
In [85]:
``````
layer4
```

```
Out[85]:
```

Now is time to remember the structure of our network

We need to compare our output, layer4 tensor, with ground truth for all mini_batch. we can use **cross entropy** to see how bad our CNN is working - to measure the error at a softmax layer.

The following code shows an toy sample of cross-entropy for a mini-batch of size 2 which its items have been classified. You can run it (first change the cell type to **code** in the toolbar) to see hoe cross entropy changes.

import numpy as np
layer4_test =[[0.9, 0.1, 0.1],[0.9, 0.1, 0.1]]
y_test=[[1.0, 0.0, 0.0],[1.0, 0.0, 0.0]]
np.mean( -np.sum(y_test * np.log(layer4_test),1))

**reduce_sum** computes the sum of elements of **(y_ * tf.log(layer4)** across second dimension of the tensor, and **reduce_mean** computes the mean of all elements in the tensor..

```
In [86]:
```cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(layer4), reduction_indices=[1]))

It is obvious that we want minimize the error of our network which is calculated by cross_entropy metric. To solve the problem, we have to compute gradients for the loss (which is minimizing the cross-entropy) and apply gradients to variables. It will be done by an optimizer: GradientDescent or Adagrad.

```
In [87]:
```train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

```
In [88]:
```correct_prediction = tf.equal(tf.argmax(layer4,1), tf.argmax(y_,1))

```
In [89]:
```accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

```
In [90]:
```sess.run(tf.initialize_all_variables())

*If you want a fast result ( it might take sometime to train it)*

```
In [91]:
```for i in range(1100):
batch = mnist.train.next_batch(50)
if i%100 == 0:
#train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_: batch[1], keep_prob: 1.0})
loss, train_accuracy = sess.run([cross_entropy, accuracy], feed_dict={x: batch[0],y_: batch[1],keep_prob: 1.0})
print("step %d, loss %g, training accuracy %g"%(i, float(loss),float(train_accuracy)))
train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

```
```

for i in range(20000):
batch = mnist.train.next_batch(50)
if i%100 == 0:
train_accuracy = accuracy.eval(feed_dict={
x:batch[0], y_: batch[1], keep_prob: 1.0})
print("step %d, training accuracy %g"%(i, train_accuracy))
train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

Print the evaluation to the user

```
In [92]:
```print("test accuracy %g"%accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

```
```

Do you want to look at all the filters?

```
In [93]:
```kernels = sess.run(tf.reshape(tf.transpose(W_conv1, perm=[2, 3, 0,1]),[32,-1]))

```
In [94]:
```from utils import tile_raster_images
import matplotlib.pyplot as plt
from PIL import Image
%matplotlib inline
image = Image.fromarray(tile_raster_images(kernels, img_shape=(5, 5) ,tile_shape=(4, 8), tile_spacing=(1, 1)))
### Plot image
plt.rcParams['figure.figsize'] = (18.0, 18.0)
imgplot = plt.imshow(image)
imgplot.set_cmap('gray')

```
```

Do you want to see the output of an image passing through first convolution layer?

```
In [95]:
```import numpy as np
plt.rcParams['figure.figsize'] = (5.0, 5.0)
sampleimage = mnist.test.images[1]
plt.imshow(np.reshape(sampleimage,[28,28]), cmap="gray")

```
Out[95]:
```

```
In [100]:
```ActivatedUnits.shape

```
Out[100]:
```

```
In [103]:
```ActivatedUnitsL1 = sess.run(convolve1,feed_dict={x:np.reshape(sampleimage,[1,784],order='F'),keep_prob:1.0})
filters = ActivatedUnitsL1.shape[3]
plt.figure(1, figsize=(20,20))
n_columns = 6
n_rows = np.math.ceil(filters / n_columns) + 1
for i in range(filters):
plt.subplot(n_rows, n_columns, i+1)
plt.title('Cov1_ ' + str(i))
plt.imshow(ActivatedUnitsL1[0,:,:,i], interpolation="nearest", cmap="gray")

```
```

What about second convolution layer?

```
In [105]:
```ActivatedUnitsL2 = sess.run(convolve2,feed_dict={x:np.reshape(sampleimage,[1,784],order='F'),keep_prob:1.0})
filters = ActivatedUnitsL2.shape[3]
plt.figure(1, figsize=(20,20))
n_columns = 8
n_rows = np.math.ceil(filters / n_columns) + 1
for i in range(filters):
plt.subplot(n_rows, n_columns, i+1)
plt.title('Conv_2 ' + str(i))
plt.imshow(ActivatedUnitsL2[0,:,:,i], interpolation="nearest", cmap="gray")

```
```

```
In [57]:
```sess.close() #finish the session

Saeed Aghabozorgi, PhD is Sr. Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>

https://en.wikipedia.org/wiki/Deep_learning

http://sebastianruder.com/optimizing-gradient-descent/index.html#batchgradientdescent

http://yann.lecun.com/exdb/mnist/

https://www.quora.com/Artificial-Neural-Networks-What-is-the-difference-between-activation-functions

https://www.tensorflow.org/versions/r0.9/tutorials/mnist/pros/index.html