In this section, we will use the famous MNIST Dataset to build two Neural Networks capable to perform handwritten digits classification. The first Network is a simple Multi-layer Perceptron (MLP) and the second one is a Convolutional Neural Network (CNN from now on). In other words, our algorithm will say, with some associated error, what type of digit is the presented input.
This lesson is not intended to be a reference for machine learning, convolutions or TensorFlow. The intention is to give notions to the user about these fields and awareness of Data Scientist Workbench capabilities. We recommend that the students search for further references to understand completely the mathematical and theoretical concepts involved.
Brief Theory: Deep learning (also known as deep structured learning, hierarchical learning or deep machine learning) is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using multiple processing layers, with complex structures or otherwise, composed of multiple non-linear transformations.
We are going to create a simple Multi-layer percepetron, a simple type of Neural Network, to performe classification tasks on the MNIST digits dataset. If you are not familiar with the MNIST dataset, please consider to read more about it: click here
According to Lecun's website, the MNIST is a: "database of handwritten digits that has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image".
It's very important to notice that MNIST is a high optimized data-set and it does not contain images. You will need to build your own code if you want to see the real digits. Another important side note is the effort that the authors invested on this data-set with normalization and centering operations.
In [1]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
The One-hot = True argument only means that, in contrast to Binary representation, the labels will be presented in a way that only one bit will be on for a specific digit. For example, five and zero in a binary code would be:
Number representation: 0 Binary encoding: [2^5] [2^4] [2^3] [2^2] [2^1] [2^0] Array/vector: 0 0 0 0 0 0 Number representation: 5 Binary encoding: [2^5] [2^4] [2^3] [2^2] [2^1] [2^0] Array/vector: 0 0 0 1 0 1
Using a different notation, the same digits using one-hot vector representation can be show as:
Number representation: 0 One-hot encoding: [5] [4] [3] [2] [1] [0] Array/vector: 0 0 0 0 0 1 Number representation: 5 One-hot encoding: [5] [4] [3] [2] [1] [0] Array/vector: 1 0 0 0 0 0
The imported data can be divided as follow:
- 55,000 data points
- mnist.train.images for inputs
- mnist.train.labels for outputs
- 5,000 data points
- mnist.validation.images for inputs
- mnist.validation.labels for outputs
- 10,000 data points
- mnist.test.images for inputs
- mnist.test.labels for outputs
You have two basic options when using TensorFlow to run your code:
For this first part, we will use the interactive session that is more suitable for environments like Jupyter notebooks.
In [2]:
sess = tf.InteractiveSession()
It's a best practice to create placeholders before variable assignments when using TensorFlow. Here we'll create placeholders for inputs ("Xs") and outputs ("Ys").
Placeholder 'X': represents the "space" allocated input or the images.
* Each input has 784 pixels distributed by a 28 width x 28 height matrix
* The 'shape' argument defines the tensor size by its dimensions.
* 1st dimension = None. Indicates that the batch size, can be of any size.
* 2nd dimension = 784. Indicates the number of pixels on a single flattened MNIST image.
Placeholder 'Y':_ represents the final output or the labels.
* 10 possible classes (0,1,2,3,4,5,6,7,8,9)
* The 'shape' argument defines the tensor size by its dimensions.
* 1st dimension = None. Indicates that the batch size, can be of any size.
* 2nd dimension = 10. Indicates the number of targets/outcomes
dtype for both placeholders: if you not sure, use tf.float32. The limitation here is that the later presented softmax function only accepts float32 or float64 dtypes. For more dtypes, check TensorFlow's documentation here
In [3]:
x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])
Now we are going to create the weights and biases, for this purpose they will be used as arrays filled with zeros. The values that we choose here can be critical, but we'll cover a better way on the second part, instead of this type of initialization.
In [4]:
# Weight tensor
W = tf.Variable(tf.zeros([784,10],tf.float32))
# Bias tensor
b = tf.Variable(tf.zeros([10],tf.float32))
Before, we assigned the weights and biases but we did not initialize them with null values. For this reason, TensorFlow need to initialize the variables that you assign.
Please notice that we're using this notation "sess.run" because we previously started an interactive session.
In [5]:
# run the op initialize_all_variables using an interactive session
sess.run(tf.initialize_all_variables())
The only difference from our next operation to the picture below is that we are using the mathematical convention for what is being executed in the illustration. The tf.matmul operation performs a matrix multiplication between x (inputs) and W (weights) and after the code add biases.
In [6]:
#mathematical operation to add weights and biases to the inputs
tf.matmul(x,W) + b
Out[6]:
Softmax is an activation function that is normally used in classification problems. It generate the probabilities for the output. For example, our model will not be 100% sure that one digit is the number nine, instead, the answer will be a distribution of probabilities where, if the model is right, the nine number will have the larger probability.
For comparison, below is the one-hot vector for a nine digit label:
A machine does not have all this certainty, so we want to know what is the best guess, but we also want to understand how sure it was and what was the second better option. Below is an example of a hypothetical distribution for a nine digit:
In [7]:
y = tf.nn.softmax(tf.matmul(x,W) + b)
Logistic function output is used for the classification between two target classes 0/1. Softmax function is generalized type of logistic function. That is, Softmax can output a multiclass categorical probability distribution.
It is a function that is used to minimize the difference between the right answers (labels) and estimated outputs by our Network.
In [8]:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
This is the part where you configure the optimizer for you Neural Network. There are several optimizers available, in our case we will use Gradient Descent that is very well stablished.
In [9]:
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
Train using minibatch Gradient Descent.
In practice, Batch Gradient Descent is not often used because is too computationally expensive. The good part about this method is that you have the true gradient, but with the expensive computing task of using the whole dataset in one time. Due to this problem, Neural Networks usually use minibatch to train.
In [10]:
batch = mnist.train.next_batch(50)
In [11]:
batch[0].shape
Out[11]:
In [12]:
type(batch[0])
Out[12]:
In [13]:
mnist.train.images.shape
Out[13]:
In [14]:
#Load 50 training examples for each training iteration
for i in range(1000):
batch = mnist.train.next_batch(50)
train_step.run(feed_dict={x: batch[0], y_: batch[1]})
In [15]:
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
acc = accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels}) * 100
print("The final accuracy for the simple ANN model is: {} % ".format(acc) )
In [16]:
sess.close() #finish the session
Is the final result good?
Let's check the best algorithm available out there (10th june 2016):
Result: 0.21% error (99.79% accuracy)
Reference here
In the first part, we learned how to use a simple ANN to classify MNIST. Now we are going to expand our knowledge using a Deep Neural Network.
Architecture of our network is:
The next cells will explore this new architecture.
In [58]:
import tensorflow as tf
# finish possible remaining session
sess.close()
#Start interactive session
sess = tf.InteractiveSession()
In [59]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
Create general parameters for the model
In [60]:
width = 28 # width of the image in pixels
height = 28 # height of the image in pixels
flat = width * height # number of pixels in one image
class_output = 10 # number of possible classifications for the problem
Create place holders for inputs and outputs
In [61]:
x = tf.placeholder(tf.float32, shape=[None, flat])
y_ = tf.placeholder(tf.float32, shape=[None, class_output])
The input image is a 28 pixels by 28 pixels and 1 channel (grayscale)
In this case the first dimension is the batch number of the image (position of the input on the batch) and can be of any size (due to -1)
In [62]:
x_image = tf.reshape(x, [-1,28,28,1])
In [63]:
x_image
Out[63]:
Size of the filter/kernel: 5x5;
Input channels: 1 (greyscale);
32 feature maps (here, 32 feature maps means 32 different filters are applied on each image. So, the output of convolution layer would be 28x28x32). In this step, we create a filter / kernel tensor of shape [filter_height, filter_width, in_channels, out_channels]
In [64]:
W_conv1 = tf.Variable(tf.truncated_normal([5, 5, 1, 32], stddev=0.1))
b_conv1 = tf.Variable(tf.constant(0.1, shape=[32])) # need 32 biases for 32 outputs
Defining a function to create convolutional layers. To creat convolutional layer, we use tf.nn.conv2d. It computes a 2-D convolution given 4-D input and filter tensors.
Inputs:
Process:
[batch, 28, 28, 5*5*1]
.Output:
Tensor
(a 2-D convolution) of size <tf.Tensor 'add_7:0' shape=(?, 28, 28, 32)- Notice: the output of the first convolution layer is 32 [28x28] images. Here 32 is considered as volume/depth of the output image.
In [65]:
convolve1= tf.nn.conv2d(x_image, W_conv1, strides=[1, 1, 1, 1], padding='SAME') + b_conv1
In this step, we just go through all outputs convolution layer, covolve1, and wherever a negative number occurs,we swap it out for a 0. It is called ReLU activation Function.
In [66]:
h_conv1 = tf.nn.relu(convolve1)
Use the max pooling operation already defined, so the output would be 14x14x32
Defining a function to perform max pooling. The maximum pooling is an operation that finds maximum values and simplifies the inputs using the spacial correlations between them.
Kernel size: 2x2 (if the window is a 2x2 matrix, it would result in one output pixel)
Strides: dictates the sliding behaviour of the kernel. In this case it will move 2 pixels everytime, thus not overlapping.
In [67]:
h_pool1 = tf.nn.max_pool(h_conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') #max_pool_2x2
In [68]:
layer1= h_pool1
Filter/kernel: 5x5 (25 pixels) ; Input channels: 32 (from the 1st Conv layer, we had 32 feature maps); 64 output feature maps
Notice: here, the input is 14x14x32, the filter is 5x5x32, we use 64 filters, and the output of the convolutional layer would be 14x14x64.
Notice: the convolution result of applying a filter of size [5x5x32] on image of size [14x14x32] is an image of size [14x14x1], that is, the convolution is functioning on volume.
In [69]:
W_conv2 = tf.Variable(tf.truncated_normal([5, 5, 32, 64], stddev=0.1))
b_conv2 = tf.Variable(tf.constant(0.1, shape=[64])) #need 64 biases for 64 outputs
In [70]:
convolve2= tf.nn.conv2d(layer1, W_conv2, strides=[1, 1, 1, 1], padding='SAME')+ b_conv2
In [71]:
h_conv2 = tf.nn.relu(convolve2)
In [72]:
h_pool2 = tf.nn.max_pool(h_conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') #max_pool_2x2
In [73]:
layer2= h_pool2
So, what is the output of the second layer, layer2?
Type: Fully Connected Layer. You need a fully connected layer to use the Softmax and create the probabilities in the end. Fully connected layers take the high-level filtered images from previous layer, that is all 64 matrics, and convert them to an array.
So, each matrix [7x7] will be converted to a matrix of [49x1], and then all of the 64 matrix will be connected, which make an array of size [3136x1]. We will connect it into another layer of size [1024x1]. So, the weight between these 2 layers will be [3136x1024]
In [74]:
layer2_matrix = tf.reshape(layer2, [-1, 7*7*64])
Composition of the feature map from the last layer (7x7) multiplied by the number of feature maps (64); 1027 outputs to Softmax layer
In [75]:
W_fc1 = tf.Variable(tf.truncated_normal([7 * 7 * 64, 1024], stddev=0.1))
b_fc1 = tf.Variable(tf.constant(0.1, shape=[1024])) # need 1024 biases for 1024 outputs
In [76]:
fcl3=tf.matmul(layer2_matrix, W_fc1) + b_fc1
In [77]:
h_fc1 = tf.nn.relu(fcl3)
In [78]:
layer3= h_fc1
In [79]:
layer3
Out[79]:
It is a phase where the network "forget" some features. At each training step in a mini-batch, some units get switched off randomly so that it will not interact with the network. That is, it weights cannot be updated, nor affect the learning of the other network nodes. This can be very useful for very large neural networks to prevent overfitting.
In [80]:
keep_prob = tf.placeholder(tf.float32)
layer3_drop = tf.nn.dropout(layer3, keep_prob)
Type: Softmax, Fully Connected Layer.
In last layer, CNN takes the high-level filtered images and translate them into votes using softmax. Input channels: 1024 (neurons from the 3rd Layer); 10 output features
In [81]:
W_fc2 = tf.Variable(tf.truncated_normal([1024, 10], stddev=0.1)) #1024 neurons
b_fc2 = tf.Variable(tf.constant(0.1, shape=[10])) # 10 possibilities for digits [0,1,2,3,4,5,6,7,8,9]
In [82]:
fcl4=tf.matmul(layer3_drop, W_fc2) + b_fc2
In [83]:
y_conv= tf.nn.softmax(fcl4)
In [84]:
layer4= y_conv
In [85]:
layer4
Out[85]:
Now is time to remember the structure of our network
We need to compare our output, layer4 tensor, with ground truth for all mini_batch. we can use cross entropy to see how bad our CNN is working - to measure the error at a softmax layer.
The following code shows an toy sample of cross-entropy for a mini-batch of size 2 which its items have been classified. You can run it (first change the cell type to code in the toolbar) to see hoe cross entropy changes.
reduce_sum computes the sum of elements of (y_ * tf.log(layer4) across second dimension of the tensor, and reduce_mean computes the mean of all elements in the tensor..
In [86]:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(layer4), reduction_indices=[1]))
It is obvious that we want minimize the error of our network which is calculated by cross_entropy metric. To solve the problem, we have to compute gradients for the loss (which is minimizing the cross-entropy) and apply gradients to variables. It will be done by an optimizer: GradientDescent or Adagrad.
In [87]:
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
In [88]:
correct_prediction = tf.equal(tf.argmax(layer4,1), tf.argmax(y_,1))
In [89]:
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
In [90]:
sess.run(tf.initialize_all_variables())
If you want a fast result (it might take sometime to train it)
In [91]:
for i in range(1100):
batch = mnist.train.next_batch(50)
if i%100 == 0:
#train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_: batch[1], keep_prob: 1.0})
loss, train_accuracy = sess.run([cross_entropy, accuracy], feed_dict={x: batch[0],y_: batch[1],keep_prob: 1.0})
print("step %d, loss %g, training accuracy %g"%(i, float(loss),float(train_accuracy)))
train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})
PS. If you have problems running this notebook, please shutdown all your Jupyter runnning notebooks, clear all cells outputs and run each cell only after the completion of the previous cell.
Print the evaluation to the user
In [92]:
print("test accuracy %g"%accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))
Do you want to look at all the filters?
In [93]:
kernels = sess.run(tf.reshape(tf.transpose(W_conv1, perm=[2, 3, 0,1]),[32,-1]))
In [94]:
from utils import tile_raster_images
import matplotlib.pyplot as plt
from PIL import Image
%matplotlib inline
image = Image.fromarray(tile_raster_images(kernels, img_shape=(5, 5) ,tile_shape=(4, 8), tile_spacing=(1, 1)))
### Plot image
plt.rcParams['figure.figsize'] = (18.0, 18.0)
imgplot = plt.imshow(image)
imgplot.set_cmap('gray')
Do you want to see the output of an image passing through first convolution layer?
In [95]:
import numpy as np
plt.rcParams['figure.figsize'] = (5.0, 5.0)
sampleimage = mnist.test.images[1]
plt.imshow(np.reshape(sampleimage,[28,28]), cmap="gray")
Out[95]:
In [100]:
ActivatedUnits.shape
Out[100]:
In [103]:
ActivatedUnitsL1 = sess.run(convolve1,feed_dict={x:np.reshape(sampleimage,[1,784],order='F'),keep_prob:1.0})
filters = ActivatedUnitsL1.shape[3]
plt.figure(1, figsize=(20,20))
n_columns = 6
n_rows = np.math.ceil(filters / n_columns) + 1
for i in range(filters):
plt.subplot(n_rows, n_columns, i+1)
plt.title('Cov1_ ' + str(i))
plt.imshow(ActivatedUnitsL1[0,:,:,i], interpolation="nearest", cmap="gray")
What about second convolution layer?
In [105]:
ActivatedUnitsL2 = sess.run(convolve2,feed_dict={x:np.reshape(sampleimage,[1,784],order='F'),keep_prob:1.0})
filters = ActivatedUnitsL2.shape[3]
plt.figure(1, figsize=(20,20))
n_columns = 8
n_rows = np.math.ceil(filters / n_columns) + 1
for i in range(filters):
plt.subplot(n_rows, n_columns, i+1)
plt.title('Conv_2 ' + str(i))
plt.imshow(ActivatedUnitsL2[0,:,:,i], interpolation="nearest", cmap="gray")
In [57]:
sess.close() #finish the session
Saeed Aghabozorgi, PhD is Sr. Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>
https://en.wikipedia.org/wiki/Deep_learning
http://sebastianruder.com/optimizing-gradient-descent/index.html#batchgradientdescent
http://yann.lecun.com/exdb/mnist/
https://www.quora.com/Artificial-Neural-Networks-What-is-the-difference-between-activation-functions
https://www.tensorflow.org/versions/r0.9/tutorials/mnist/pros/index.html