This tutorial is slightly modified from original version. The original version of this tutorial belongs to Hvass Labs. Data dimensions in deep learning is different from other data sets. The information of images is stored in an array of integers and the length of the array is equal to square of image size in each dimension.

The purpose of this simple linear model example is to classify images into 10 classes: 0 to 9. Based on input variables and to be optimized variables (weights and biases), the model is built with a linear expression to get logits. Logits is a matrix of num_imagegs rows and num_classes columns. Then, optimization happens. Minimize the average cross-entropy or cost to get a better result because when the result prefectly matches, cross-entropy would be 0. Different parameters might be chosen to get a better optimization result. In this exercise, I tried different values of optimizer's learning rates and iterations batch sizes.

This tutorial demonstrates the basic workflow of using TensorFlow with a simple linear model. After loading the so-called MNIST data-set with images of hand-written digits, we define and optimize a simple mathematical model in TensorFlow. The results are then plotted and discussed.

You should be familiar with basic linear algebra, Python and the Jupyter Notebook editor. It also helps if you have a basic understanding of Machine Learning and classification.

```
In [56]:
```%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
from sklearn.metrics import confusion_matrix

This was developed using Python 3.5.2 (Anaconda) and TensorFlow version:

```
In [57]:
```tf.__version__

```
Out[57]:
```

```
In [58]:
```from tensorflow.examples.tutorials.mnist import input_data
data = input_data.read_data_sets("data/MNIST/", one_hot=True)

```
```

```
In [59]:
```print("Size of:")
print("- Training-set:\t\t{}".format(len(data.train.labels)))
print("- Test-set:\t\t{}".format(len(data.test.labels)))
print("- Validation-set:\t{}".format(len(data.validation.labels)))

```
```

```
In [60]:
```data.test.labels[0:5, :]

```
Out[60]:
```

```
In [61]:
```data.test.cls = np.array([label.argmax() for label in data.test.labels])
data.train.cls = np.array([label.argmax() for label in data.train.labels])

```
In [62]:
```data.test.cls[0:5]

```
Out[62]:
```

```
In [63]:
```# We know that MNIST images are 28 pixels in each dimension.
img_size = 28
# Images are stored in one-dimensional arrays of this length.
img_size_flat = img_size * img_size
# Tuple with height and width of images used to reshape arrays.
img_shape = (img_size, img_size)
# Number of classes, one class for each of 10 digits.
num_classes = 10

```
In [64]:
```def plot_images(images, cls_true, cls_pred=None):
assert len(images) == len(cls_true) == 9
# Create figure with 3x3 sub-plots.
fig, axes = plt.subplots(3, 3)
fig.subplots_adjust(hspace=0.3, wspace=0.3)
for i, ax in enumerate(axes.flat):
# Plot image.
ax.imshow(images[i].reshape(img_shape), cmap='binary')
# Show true and predicted classes.
if cls_pred is None:
xlabel = "True: {0}".format(cls_true[i])
else:
xlabel = "True: {0}, Pred: {1}".format(cls_true[i], cls_pred[i])
ax.set_xlabel(xlabel)
# Remove ticks from the plot.
ax.set_xticks([])
ax.set_yticks([])

I changed here from test plot to train set plot to see how the function works.

```
In [66]:
```# Get the first images from the test-set.
#images = data.test.images[0:9]
images = data.train.images[0:9]
# Get the true classes for those images.
#cls_true = data.test.cls[0:9]
cls_true = data.train.cls[0:9]
# Plot the images and labels using our helper-function above.
#plot_images(images=images, cls_true=cls_true)
plot_images(images=images, cls_true=cls_true)

```
```

The entire purpose of TensorFlow is to have a so-called computational graph that can be executed much more efficiently than if the same calculations were to be performed directly in Python. TensorFlow can be more efficient than NumPy because TensorFlow knows the entire computation graph that must be executed, while NumPy only knows the computation of a single mathematical operation at a time.

TensorFlow can also automatically calculate the gradients that are needed to optimize the variables of the graph so as to make the model perform better. This is because the graph is a combination of simple mathematical expressions so the gradient of the entire graph can be calculated using the chain-rule for derivatives.

TensorFlow can also take advantage of multi-core CPUs as well as GPUs - and Google has even built special chips just for TensorFlow which are called TPUs (Tensor Processing Units) and are even faster than GPUs.

A TensorFlow graph consists of the following parts which will be detailed below:

- Placeholder variables used to change the input to the graph.
- Model variables that are going to be optimized so as to make the model perform better.
- The model which is essentially just a mathematical function that calculates some output given the input in the placeholder variables and the model variables.
- A cost measure that can be used to guide the optimization of the variables.
- An optimization method which updates the variables of the model.

In addition, the TensorFlow graph may also contain various debugging statements e.g. for logging data to be displayed using TensorBoard, which is not covered in this tutorial.

Placeholder variables serve as the input to the graph that we may change each time we execute the graph. We call this feeding the placeholder variables and it is demonstrated further below.

First we define the placeholder variable for the input images. This allows us to change the images that are input to the TensorFlow graph. This is a so-called tensor, which just means that it is a multi-dimensional vector or matrix. The data-type is set to `float32`

and the shape is set to `[None, img_size_flat]`

, where `None`

means that the tensor may hold an arbitrary number of images with each image being a vector of length `img_size_flat`

.

```
In [67]:
```x = tf.placeholder(tf.float32, [None, img_size_flat])

`x`

. The shape of this placeholder variable is `[None, num_classes]`

which means it may hold an arbitrary number of labels and each label is a vector of length `num_classes`

which is 10 in this case.

```
In [68]:
```y_true = tf.placeholder(tf.float32, [None, num_classes])

`x`

. These are integers and the dimensionality of this placeholder variable is set to `[None]`

which means the placeholder variable is a one-dimensional vector of arbitrary length.

```
In [69]:
```y_true_cls = tf.placeholder(tf.int64, [None])

Apart from the placeholder variables that were defined above and which serve as feeding input data into the model, there are also some model variables that must be changed by TensorFlow so as to make the model perform better on the training data.

The first variable that must be optimized is called `weights`

and is defined here as a TensorFlow variable that must be initialized with zeros and whose shape is `[img_size_flat, num_classes]`

, so it is a 2-dimensional tensor (or matrix) with `img_size_flat`

rows and `num_classes`

columns.

```
In [70]:
```weights = tf.Variable(tf.zeros([img_size_flat, num_classes]))

`biases`

and is defined as a 1-dimensional tensor (or vector) of length `num_classes`

.

```
In [71]:
```biases = tf.Variable(tf.zeros([num_classes]))

This simple mathematical model multiplies the images in the placeholder variable `x`

with the `weights`

and then adds the `biases`

.

The result is a matrix of shape `[num_images, num_classes]`

because `x`

has shape `[num_images, img_size_flat]`

and `weights`

has shape `[img_size_flat, num_classes]`

, so the multiplication of those two matrices is a matrix with shape `[num_images, num_classes]`

and then the `biases`

vector is added to each row of that matrix.

Note that the name `logits`

is typical TensorFlow terminology, but other people may call the variable something else.

```
In [72]:
```logits = tf.matmul(x, weights) + biases

Now `logits`

is a matrix with `num_images`

rows and `num_classes`

columns, where the element of the $i$'th row and $j$'th column is an estimate of how likely the $i$'th input image is to be of the $j$'th class.

However, these estimates are a bit rough and difficult to interpret because the numbers may be very small or large, so we want to normalize them so that each row of the `logits`

matrix sums to one, and each element is limited between zero and one. This is calculated using the so-called softmax function and the result is stored in `y_pred`

.

```
In [73]:
```y_pred = tf.nn.softmax(logits)

`y_pred`

matrix by taking the index of the largest element in each row.

```
In [74]:
```y_pred_cls = tf.argmax(y_pred, dimension=1)

To make the model better at classifying the input images, we must somehow change the variables for `weights`

and `biases`

. To do this we first need to know how well the model currently performs by comparing the predicted output of the model `y_pred`

to the desired output `y_true`

.

The cross-entropy is a performance measure used in classification. The cross-entropy is a continuous function that is always positive and if the predicted output of the model exactly matches the desired output then the cross-entropy equals zero. The goal of optimization is therefore to minimize the cross-entropy so it gets as close to zero as possible by changing the `weights`

and `biases`

of the model.

TensorFlow has a built-in function for calculating the cross-entropy. Note that it uses the values of the `logits`

because it also calculates the softmax internally.

```
In [75]:
```cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits,
labels=y_true)

```
In [76]:
```cost = tf.reduce_mean(cross_entropy)

Now that we have a cost measure that must be minimized, we can then create an optimizer. In this case it is the basic form of Gradient Descent where the step-size is set to 0.5.

Note that optimization is not performed at this point. In fact, nothing is calculated at all, we just add the optimizer-object to the TensorFlow graph for later execution.

Here, I changed the step-size from 0.5 to 0.4.

```
In [77]:
```#optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5).minimize(cost)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.4).minimize(cost)

We need a few more performance measures to display the progress to the user.

This is a vector of booleans whether the predicted class equals the true class of each image.

```
In [78]:
```correct_prediction = tf.equal(y_pred_cls, y_true_cls)

```
In [79]:
```accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

```
In [80]:
```session = tf.Session()

```
In [81]:
```session.run(tf.global_variables_initializer())

Here, I changed to batch_size from 100 to 200.

```
In [82]:
```#batch_size = 100
batch_size = 200

`weights`

and `biases`

of the model. In each iteration, a new batch of data is selected from the training-set and then TensorFlow executes the optimizer using those training samples.

```
In [83]:
```def optimize(num_iterations):
for i in range(num_iterations):
# Get a batch of training examples.
# x_batch now holds a batch of images and
# y_true_batch are the true labels for those images.
x_batch, y_true_batch = data.train.next_batch(batch_size)
# Put the batch into a dict with the proper names
# for placeholder variables in the TensorFlow graph.
# Note that the placeholder for y_true_cls is not set
# because it is not used during training.
feed_dict_train = {x: x_batch,
y_true: y_true_batch}
# Run the optimizer using this batch of training data.
# TensorFlow assigns the variables in feed_dict_train
# to the placeholder variables and then runs the optimizer.
session.run(optimizer, feed_dict=feed_dict_train)

```
In [84]:
```feed_dict_test = {x: data.test.images,
y_true: data.test.labels,
y_true_cls: data.test.cls}

Function for printing the classification accuracy on the test-set.

```
In [85]:
```def print_accuracy():
# Use TensorFlow to compute the accuracy.
acc = session.run(accuracy, feed_dict=feed_dict_test)
# Print the accuracy.
print("Accuracy on test-set: {0:.1%}".format(acc))

Function for printing and plotting the confusion matrix using scikit-learn.

```
In [86]:
```def print_confusion_matrix():
# Get the true classifications for the test-set.
cls_true = data.test.cls
# Get the predicted classifications for the test-set.
cls_pred = session.run(y_pred_cls, feed_dict=feed_dict_test)
# Get the confusion matrix using sklearn.
cm = confusion_matrix(y_true=cls_true,
y_pred=cls_pred)
# Print the confusion matrix as text.
print(cm)
# Plot the confusion matrix as an image.
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
# Make various adjustments to the plot.
plt.tight_layout()
plt.colorbar()
tick_marks = np.arange(num_classes)
plt.xticks(tick_marks, range(num_classes))
plt.yticks(tick_marks, range(num_classes))
plt.xlabel('Predicted')
plt.ylabel('True')

Function for plotting examples of images from the test-set that have been mis-classified.

```
In [87]:
```def plot_example_errors():
# Use TensorFlow to get a list of boolean values
# whether each test-image has been correctly classified,
# and a list for the predicted class of each image.
correct, cls_pred = session.run([correct_prediction, y_pred_cls],
feed_dict=feed_dict_test)
# Negate the boolean array.
incorrect = (correct == False)
# Get the images from the test-set that have been
# incorrectly classified.
images = data.test.images[incorrect]
# Get the predicted classes for those images.
cls_pred = cls_pred[incorrect]
# Get the true classes for those images.
cls_true = data.test.cls[incorrect]
# Plot the first 9 images.
plot_images(images=images[0:9],
cls_true=cls_true[0:9],
cls_pred=cls_pred[0:9])

`weights`

of the model. 10 images are plotted, one for each digit that the model is trained to recognize.

```
In [88]:
```def plot_weights():
# Get the values for the weights from the TensorFlow variable.
w = session.run(weights)
# Get the lowest and highest values for the weights.
# This is used to correct the colour intensity across
# the images so they can be compared with each other.
w_min = np.min(w)
w_max = np.max(w)
# Create figure with 3x4 sub-plots,
# where the last 2 sub-plots are unused.
fig, axes = plt.subplots(3, 4)
fig.subplots_adjust(hspace=0.3, wspace=0.3)
for i, ax in enumerate(axes.flat):
# Only use the weights for the first 10 sub-plots.
if i<10:
# Get the weights for the i'th digit and reshape it.
# Note that w.shape == (img_size_flat, 10)
image = w[:, i].reshape(img_shape)
# Set the label for the sub-plot.
ax.set_xlabel("Weights: {0}".format(i))
# Plot the image.
ax.imshow(image, vmin=w_min, vmax=w_max, cmap='seismic')
# Remove ticks from each sub-plot.
ax.set_xticks([])
ax.set_yticks([])

The accuracy on the test-set is 9.8%. This is because the model has only been initialized and not optimized at all, so it always predicts that the image shows a zero digit, as demonstrated in the plot below, and it turns out that 9.8% of the images in the test-set happens to be zero digits.

```
In [89]:
```print_accuracy()

```
```

```
In [90]:
```plot_example_errors()

```
```

```
In [91]:
```optimize(num_iterations=1)

```
In [92]:
```print_accuracy()

```
```

```
In [93]:
```plot_example_errors()

```
```

The weights can also be plotted as shown below. Positive weights are red and negative weights are blue. These weights can be intuitively understood as image-filters.

For example, the weights used to determine if an image shows a zero-digit have a positive reaction (red) to an image of a circle, and have a negative reaction (blue) to images with content in the centre of the circle.

Similarly, the weights used to determine if an image shows a one-digit react positively (red) to a vertical line in the centre of the image, and react negatively (blue) to images with content surrounding that line.

Note that the weights mostly look like the digits they're supposed to recognize. This is because only one optimization iteration has been performed so the weights are only trained on 100 images. After training on several thousand images, the weights become more difficult to interpret because they have to recognize many variations of how digits can be written.

```
In [94]:
```plot_weights()

```
```

```
In [95]:
```# We have already performed 1 iteration.
optimize(num_iterations=9)

```
In [96]:
```print_accuracy()

```
```

```
In [97]:
```plot_example_errors()

```
```

```
In [98]:
```plot_weights()

```
```

After 1000 optimization iterations, the model only mis-classifies about one in ten images. As demonstrated below, some of the mis-classifications are justified because the images are very hard to determine with certainty even for humans, while others are quite obvious and should have been classified correctly by a good model. But this simple model cannot reach much better performance and more complex models are therefore needed.

```
In [99]:
```# We have already performed 10 iterations.
optimize(num_iterations=990)

```
In [100]:
```print_accuracy()

```
```

```
In [101]:
```plot_example_errors()

```
```

```
In [102]:
```plot_weights()

```
```

```
In [103]:
```print_confusion_matrix()

```
```

We are now done using TensorFlow, so we close the session to release its resources.

```
In [104]:
```# This has been commented out in case you want to modify and experiment
# with the Notebook without having to restart it.
# session.close()

These are a few suggestions for exercises that may help improve your skills with TensorFlow. It is important to get hands-on experience with TensorFlow in order to learn how to use it properly.

You may want to backup this Notebook before making any changes.

- Change the learning-rate for the optimizer.
- Change the optimizer to e.g.
`AdagradOptimizer`

or`AdamOptimizer`

. - Change the batch-size to e.g. 1 or 1000.
- How do these changes affect the performance?
- Do you think these changes will have the same effect (if any) on other classification problems and mathematical models?
- Do you get the exact same results if you run the Notebook multiple times without changing any parameters? Why or why not?
- Change the function
`plot_example_errors()`

so it also prints the`logits`

and`y_pred`

values for the mis-classified examples. - Use
`sparse_softmax_cross_entropy_with_logits`

instead of`softmax_cross_entropy_with_logits`

. This may require several changes to multiple places in the source-code. Discuss the advantages and disadvantages of using the two methods. - Remake the program yourself without looking too much at this source-code.
- Explain to a friend how the program works.

Copyright (c) 2016 by Magnus Erik Hvass Pedersen

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.