In [1]:
# setup-only-ignore
import tensorflow as tf
import numpy as np

In [2]:
# setup-only-ignore
sess = tf.InteractiveSession()

Images and TensorFlow

TensorFlow is designed to support working with images as input to neural networks. TensorFlow supports loading common file formats (JPG, PNG), working in different color spaces (RGB, RGBA) and common image manipulation tasks. TensorFlow makes it easier to work with images but it's still a challenge. The largest challenge working with images are the size of the tensor which is eventually loaded. Every image requires a tensor the same size as the image's \\(height \* width \* channels\\). As a reminder, channels are represented as a rank 1 tensor including a scalar amount of color in each channel.

A red RGB pixel in TensorFlow would be represented with the following tensor.

In [3]:
red = tf.constant([255, 0, 0])

Each scalar can be changed to make the pixel another color or a mix of colors. The rank 1 tensor of a pixel is in the format of [red, green, blue] for an RGB color space. All the pixels in an image are stored in files on a disk which need to be read into memory so TensorFlow may operate on them.

Loading images

TensorFlow is designed to make it easy to load files from disk quickly. Loading images is the same as loading any other large binary file until the contents are decoded. Loading this example 3x3 pixel RGB JPG image is done using a similar process to loading any other type of file.

In [4]:
# The match_filenames_once will accept a regex but there is no need for this example.
image_filename = "./images/chapter-05-object-recognition-and-classification/working-with-images/test-input-image.jpg"
filename_queue = tf.train.string_input_producer(

image_reader = tf.WholeFileReader()
_, image_file =
image = tf.image.decode_jpeg(image_file)

The image, which is assumed to be located in a relative directory from where this code is ran. An input producer (tf.train.string_input_producer) finds the files and adds them to a queue for loading. Loading an image requires loading the entire file into memory (tf.WholeFileReader) and onces a file has been read ( the resulting image is decoded (tf.image.decode_jpeg).

Now the image can be inspected, since there is only one file by that name the queue will always return the same image.

In [5]:
# setup-only-ignore
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)

In [6]:

array([[[  0,   0,   0],
        [255, 255, 255],
        [254,   0,   0]],

       [[  0, 191,   0],
        [  3, 108, 233],
        [  0, 191,   0]],

       [[254,   0,   0],
        [255, 255, 255],
        [  0,   0,   0]]], dtype=uint8)

In [7]:
# setup-only-ignore

Inspect the output from loading an image, notice that it's a fairly simple rank 3 tensor. The RGB values are found in 9 rank 1 tensors. The higher rank of the image should be familiar from earlier sections. The format of the image loaded in memory is now [batch_size, image_height, image_width, channels].

The batch_size in this example is 1 because there are no batching operations happening. Batching of input is covered in the TensorFlow documentation with a great amount of detail. When dealing with images, note the amount of memory required to load the raw images. If the images are too large or too many are loaded in a batch, the system may stop responding.

Image Formats

It's important to consider aspects of images and how they affect a model. Consider what would happen if a network is trained with input from a single frame of a RED Weapon Camera, which at the time of writing this, has an effective pixel count of 6144x3160. That'd be 19,415,040 rank one tensors with 3 dimensions of color information.

Practically speaking, an input of that size will use a huge amount of system memory. Training a CNN takes a large amount of time and loading very large files slow it down more. Even if the increase in time is acceptable, the size a single image would be hard to fit in memory on the majority of system's GPUs.

A large input image is counterproductive to training most CNNs as well. The CNN is attempting to find inherent attributes in an image, which are unique but generalized so that they may be applied to other images with similar results. Using a large input is flooding a network with irrelevant information which will keep from generalizing the model.

In the Stanford Dogs Dataset there are two extremely different images of the same dog breed which should both match as a Pug. Although cute, these images are filled with useless information which mislead a network during training. For example, the hat worn by the Pug in n02110958_4030.jpg isn't a feature a CNN needs to learn in order to match a Pug. Most Pugs prefer pirate hats so the jester hat is training the network to match a hat which most Pugs don't wear.

Highlighting important information in images is done by storing them in an appropriate file format and manipulating them. Different formats can be used to solve different problems encountered while working with images.


TensorFlow has two image formats used to decode image data, one is tf.image.decode_jpeg and the other is tf.image.decode_png. These are common file formats in computer vision applications because they're trivial to convert other formats to.

Something important to keep in mind, JPEG images don't store any alpha channel information and PNG images do. This could be important if what you're training on requires alpha information (transparency). An example usage scenario is one where you've manually cut out some pieces of an image, for example, irrelevant jester hats on dogs. Setting those pieces to black would make them seem of similar importance to other black colored items in the image. Setting the removed hat to have an alpha of 0 would help in distinguishing its removal.

When working with JPEG images, don't manipulate them too much because it'll leave artifacts. Instead, plan to take raw images and export them to JPEG while doing any manipulation needed. Try to manipulate images before loading them whenever possible to save time in training.

PNG images work well if manipulation is required. PNG format is lossless so it'll keep all the information from the original file (unless they've been resized or downsampled). The downside to PNGs is that the files are larger than their JPEG counterpart.


TensorFlow has a built-in file format designed to keep binary data and label (category for training) data in the same file. The format is called TFRecord and the format requires a preprocessing step to convert images to a TFRecord format before training. The largest benefit is keeping each input image in the same file as the label associated with it.

Technically, TFRecord files are protobuf formatted files. They are great for use as a preprocessed format because they aren't compressed and can be loaded into memory quickly. In this example, an image is written to a new TFRecord formatted file and it's label is stored as well.

In [8]:
# Reuse the image from earlier and give it a fake label
image_label = b'\x01'  # Assume the label data is in a one-hot representation (00000001)

# Convert the tensor into bytes, notice that this will load the entire image file
image_loaded =
image_bytes = image_loaded.tobytes()
image_height, image_width, image_channels = image_loaded.shape

# Export TFRecord
writer = tf.python_io.TFRecordWriter("./output/training-image.tfrecord")

# Don't store the width, height or image channels in this Example file to save space but not required.
example = tf.train.Example(features=tf.train.Features(feature={
            'label': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_label])),
            'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_bytes]))

# This will save the example to a text file tfrecord

The label is in a format known as one-hot encoding which is a common way to work with label data for categorization of multi-class data. The Stanford Dogs Dataset is being treated as multi-class data because the dogs are being categorized as a single breed and not a mix of breeds. In the real world, a multilabel solution would work well to predict dog breeds because it'd be capable of matching a dog with multiple breeds.

In the example code, the image is loaded into memory and converted into an array of bytes. The bytes are then added to the tf.train.Example file which are serialized SerializeToString before storing to disk. Serialization is a way of converting the in memory object into a format safe to be transferred to a file. The serialized example is now saved in a format which can be loaded and deserialized back to the example format saved here.

Now that the image is saved as a TFRecord it can be loaded again but this time from the TFRecord file. This would be the loading required in a training step to load the image and label for training. This will save time from loading the input image and its corresponding label separately.

In [9]:
# Load TFRecord
tf_record_filename_queue = tf.train.string_input_producer(

# Notice the different record reader, this one is designed to work with TFRecord files which may
# have more than one example in them.
tf_record_reader = tf.TFRecordReader()
_, tf_record_serialized =

# The label and image are stored as bytes but could be stored as int64 or float64 values in a
# serialized tf.Example protobuf.
tf_record_features = tf.parse_single_example(
        'label': tf.FixedLenFeature([], tf.string),
        'image': tf.FixedLenFeature([], tf.string),

# Using tf.uint8 because all of the channel information is between 0-255
tf_record_image = tf.decode_raw(
    tf_record_features['image'], tf.uint8)

# Reshape the image to look like the image saved, not required
tf_record_image = tf.reshape(
    [image_height, image_width, image_channels])
# Use real values for the height, width and channels of the image because it's required
# to reshape the input.

tf_record_label = tf.cast(tf_record_features['label'], tf.string)

At first, the file is loaded in the same way as any other file. The main difference is that the file is then read using a TFRecordReader. Instead of decoding the image, the TFRecord is parsed tf.parse_single_example and then the image is read as raw bytes (tf.decode_raw).

After the file is loaded, it is reshaped (tf.reshape) in order to keep it in the same layout as tf.nn.conv2d expects it [image_height, image_width, image_channels]. It'd be save to expand the dimensions (tf.expand) in order to add in the batch_size dimension to the input_batch.

In this case a single image is in the TFRecord but these record files support multiple examples being written to them. It'd be safe to have a single TFRecord file which stores an entire training set but splitting up the files doesn't hurt.

The following code is useful to check that the image saved to disk is the same as the image which was loaded from TensorFlow.

In [10]:
# setup-only-ignore
sess = tf.InteractiveSession()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)

In [11]:, tf_record_image))

array([[[ True,  True,  True],
        [ True,  True,  True],
        [ True,  True,  True]],

       [[ True,  True,  True],
        [ True,  True,  True],
        [ True,  True,  True]],

       [[ True,  True,  True],
        [ True,  True,  True],
        [ True,  True,  True]]], dtype=bool)

All of the attributes of the original image and the image loaded from the TFRecord file are the same. To be sure, load the label from the TFRecord file and check that it is the same as the one saved earlier.

In [12]:
# Check that the label is still 0b00000001.


In [13]:
# setup-only-ignore

Creating a file which stores both the raw image data and the expected output label will save complexities during training. It's not required to use TFRecord files but it's highly recommend when working with images. If it doesn't work well for a workflow, it's still recommended to preprocess images and save them before training. Manipulating an image each time it's loaded is not recommended.

Image Manipulation

CNNs work well when they're given a large amount of diverse quality training data. Images capture complex scenes in a way which visually communicates an intended subject. In the Stanford Dog's Dataset, it's important that the images visually highlight the importance of dogs in the picture. A picture with a dog clearly visible in the center is considered more valuable than one with a dog in the background.

Not all datasets have the most valuable images. The following are two images from the Stanford Dogs Dataset which are supposed to highlight dog breeds. The image on the left n02113978_3480.jpg highlights important attributes of a typical Mexican Hairless Dog, while the image on the right n02113978_1030.jpg highlights the look of inebriated party goers scaring a Mexican Hairless Dog. The image on the right n02113978_1030.jpg is filled with irrelevant information which may train a CNN to categorize party goer faces instead of Mexican Hairless Dog breeds. Images like this may still include an image of a dog and could be manipulated to highlight the dog instead of people.

Image manipulation is best done as a preprocessing step in most scenarios. An image can be cropped, resized and the color levels adjusted. On the other hand, there is an important use case for manipulating an image while training. After an image is loaded, it can be flipped or distorted to diversify the input training information used with the network. This step adds further processing time but helps with overfitting.

TensorFlow is not designed as an image manipulation framework. There are libraries available in Python which support more image manipulation than TensorFlow (PIL and OpenCV). For TensorFlow, we'll summarize a few useful image manipulation features available which are useful in training CNNs.


Cropping an image will remove certain regions of the image without keeping any information. Cropping is similar to tf.slice where a section of a tensor is cut out from the full tensor. Cropping an input image for a CNN can be useful if there is extra input along a dimension which isn't required. For example, cropping dog pictures where the dog is in the center of the images to reduce the size of the input.

In [13]:, 0.1))

array([[[  3, 108, 233]]], dtype=uint8)

The example code uses tf.image.central_crop to crop out 10% of the image and return it. This method always returns based on the center of the image being used.

Cropping is usually done in preprocessing but it can be useful when training if the background is useful. When the background is useful then cropping can be done while randomizing the center offset of where the crop begins.

In [14]:
# This crop method only works on real value input.
real_image =

bounding_crop = tf.image.crop_to_bounding_box(
    real_image, offset_height=0, offset_width=0, target_height=2, target_width=1)

array([[[  0,   0,   0]],

       [[  0, 191,   0]]], dtype=uint8)

The example code uses tf.image.crop_to_bounding_box in order to crop the image starting at the upper left pixel located at (0, 0). Currently, the function only works with a tensor which has a defined shape so an input image needs to be executed on the graph first.


Pad an image with zeros in order to make it the same size as an expected image. This can be accomplished using tf.pad but TensorFlow has another function useful for resizing images which are too large or too small. The method will pad an image which is too small including zeros along the edges of the image. Often, this method is used to resize small images because any other method of resizing with distort the image.

In [15]:
# This padding method only works on real value input.
real_image =

pad = tf.image.pad_to_bounding_box(
    real_image, offset_height=0, offset_width=0, target_height=4, target_width=4)

array([[[  0,   0,   0],
        [255, 255, 255],
        [254,   0,   0],
        [  0,   0,   0]],

       [[  0, 191,   0],
        [  3, 108, 233],
        [  0, 191,   0],
        [  0,   0,   0]],

       [[254,   0,   0],
        [255, 255, 255],
        [  0,   0,   0],
        [  0,   0,   0]],

       [[  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0]]], dtype=uint8)

This example code increases the images height by one pixel and its width by a pixel as well. The new pixels are all set to 0. Padding in this manner is useful for scaling up an image which is too small. This can happen if there are images in the training set with a mix of aspect ratios. TensorFlow has a useful shortcut for resizing images which don't match the same aspect ratio using a combination of pad and crop.

In [16]:
# This padding method only works on real value input.
real_image =

crop_or_pad = tf.image.resize_image_with_crop_or_pad(
    real_image, target_height=2, target_width=5)

array([[[  0,   0,   0],
        [  0,   0,   0],
        [255, 255, 255],
        [254,   0,   0],
        [  0,   0,   0]],

       [[  0,   0,   0],
        [  0, 191,   0],
        [  3, 108, 233],
        [  0, 191,   0],
        [  0,   0,   0]]], dtype=uint8)

The real_image has been reduced in height to be 2 pixels tall and the width has been increased by padding the image with zeros. This function works based on the center of the image input.


Flipping an image is exactly what it sounds like. Each pixel's location is reversed horizontally or vertically. Technically speaking, flopping is the term used when flipping an image vertically. Terms aside, flipping images is useful with TensorFlow to give different perspectives of the same image for training. For example, a picture of an Australian Shepherd with crooked left ear could be flipped in order to allow matching of crooked right ears.

TensorFlow has functions to flip images vertically, horizontally and choose randomly. The ability to randomly flip an image is a useful method to keep from overfitting a model to flipped versions of images.

In [53]:
top_left_pixels = tf.slice(image, [0, 0, 0], [2, 2, 3])

flip_horizon = tf.image.flip_left_right(top_left_pixels)
flip_vertical = tf.image.flip_up_down(flip_horizon)[top_left_pixels, flip_vertical])

[array([[[  0,   0,   0],
         [255, 255, 255]],
        [[  0, 191,   0],
         [  3, 108, 233]]], dtype=uint8), array([[[  3, 108, 233],
         [  0, 191,   0]],
        [[255, 255, 255],
         [  0,   0,   0]]], dtype=uint8)]

This example code flips a subset of the image horizontally and then vertically. The subset is used with tf.slice because the original image flipped returns the same images (for this example only). The subset of pixels illustrates the change which occurs when an image is flipped. tf.image.flip_left_right and tf.image.flip_up_down both operate on tensors which are not limited to images. These will flip an image a single time, randomly flipping an image is done using a separate set of functions.

In [56]:
top_left_pixels = tf.slice(image, [0, 0, 0], [2, 2, 3])

random_flip_horizon = tf.image.random_flip_left_right(top_left_pixels)
random_flip_vertical = tf.image.random_flip_up_down(random_flip_horizon)

array([[[  3, 108, 233],
        [  0, 191,   0]],

       [[255, 255, 255],
        [  0,   0,   0]]], dtype=uint8)

This example does the same logic as the example before except that the output is random. Every time this runs, a different output is expected. There is a parameter named seed which may be used to control how random the flipping occurs.

Saturation and Balance

Images which are found on the internet are often edited in advance. For instance, many of the images found in the Stanford Dogs dataset have too much saturation (lots of color). When an edited image is used for training, it may mislead a CNN model into finding patterns which are related to the edited image and not the content in the image.

TensorFlow has useful functions which help in training on images by changing the saturation, hue, contrast and brightness. The functions allow for simple manipulation of these image attributes as well as randomly altering these attributes. The random altering is useful in training in for the same reason randomly flipping an image is useful. The random attribute changes help a CNN be able to accurately match a feature in images which have been edited or were taken under different lighting.

In [122]:
example_red_pixel = tf.constant([254., 2., 15.])
adjust_brightness = tf.image.adjust_brightness(example_red_pixel, 0.2)

array([ 254.19999695,    2.20000005,   15.19999981], dtype=float32)

This example brightens a single pixel, which is primarily red, with a delta of 0.2. Unfortunately, in the current version of TensorFlow 0.8, this method doesn't work well with a tf.uint8 input. It's best to avoid using this when possible and preprocess brightness changes.

In [126]:
adjust_contrast = tf.image.adjust_contrast(image, -.5), [1, 0, 0], [1, 3, 3]))

array([[[170,  71, 124],
        [168, 112,   7],
        [170,  71, 124]]], dtype=uint8)

The example code changes the contrast by -0.5 which makes the new version of the image fairly unrecognizable. Adjusting contrast is best done in small increments to keep from blowing out an image. Blowing out an image means the same thing as saturating a neuron, it reached its maximum value and can't be recovered. With contrast changes, an image can become completely white and completely black from the same adjustment.

The tf.slice operation is for brevity, highlighting one of the pixels which has changed. It is not required when running this operation.

In [76]:
adjust_hue = tf.image.adjust_hue(image, 0.7), [1, 0, 0], [1, 3, 3]))

array([[[191,  38,   0],
        [ 62, 233,   3],
        [191,  38,   0]]], dtype=uint8)

The example code adjusts the hue found in the image to make it more colorful. The adjustment accepts a delta parameter which controls the amount of hue to adjust in the image.

In [80]:
adjust_saturation = tf.image.adjust_saturation(image, 0.4), [1, 0, 0], [1, 3, 3]))

array([[[114, 191, 114],
        [141, 183, 233],
        [114, 191, 114]]], dtype=uint8)

The code is similar to adjusting the contrast. It is common to oversaturate an image in order to identify edges because the increased saturation highlights changes in colors.


CNNs are commonly trained using images with a single color. When an image has a single color it is said to use a grayscale colorspace meaning it uses a single channel of colors. For most computer vision related tasks, using grayscale is reasonable because the shape of an image can be seen without all the colors. The reduction in colors equates to a quicker to train image. Instead of a 3 component rank 1 tensor to describe each color found with RGB, a grayscale image requires a single component rank 1 tensor to describe the amount of gray found in the image.

Although grayscale has benefits, it's important to consider applications which require a distinction based on color. Color in images is challenging to work with in most computer vision because it isn't easy to mathematically define the similarity of two RGB colors. In order to use colors in CNN training, it's useful to convert the colorspace the image is natively in.


Grayscale has a single component to it and has the same range of color as RGB \\([0, 255]\\).

In [88]:
gray = tf.image.rgb_to_grayscale(image), [0, 0, 0], [1, 3, 1]))

array([[[  0],
        [ 76]]], dtype=uint8)

This example converted the RGB image into grayscale. The tf.slice operation took the top row of pixels out to investigate how their color has changed. The grayscale conversion is done by averaging all the color values for a pixel and setting the amount of grayscale to be the average.


Hue, saturation and value are what make up HSV colorspace. This space is represented with a 3 component rank 1 tensor similar to RGB. HSV is not similar to RGB in what it measures, it's measuring attributes of an image which are closer to human perception of color than RGB. It is sometimes called HSB, where the B stands for brightness.

In [94]:
hsv = tf.image.rgb_to_hsv(tf.image.convert_image_dtype(image, tf.float32)), [0, 0, 0], [3, 3, 3]))

array([[[ 0.        ,  0.        ,  0.        ],
        [ 0.        ,  0.        ,  1.        ],
        [ 0.        ,  1.        ,  0.99607849]],

       [[ 0.33333334,  1.        ,  0.74901962],
        [ 0.59057975,  0.98712444,  0.91372555],
        [ 0.33333334,  1.        ,  0.74901962]],

       [[ 0.        ,  1.        ,  0.99607849],
        [ 0.        ,  0.        ,  1.        ],
        [ 0.        ,  0.        ,  0.        ]]], dtype=float32)


RGB is the colorspace which has been used in all the example code so far. It's broken up into a 3 component rank 1 tensor which includes the amount of red \\([0, 255]\\), green \\([0, 255]\\) and blue \\([0, 255]\\). Most images are already in RGB but TensorFlow has builtin functions in case the images are in another colorspace.

In [103]:
rgb_hsv = tf.image.hsv_to_rgb(hsv)
rgb_grayscale = tf.image.grayscale_to_rgb(gray)

The example code is straightforward except that the conversion from grayscale to RGB doesn't make much sense. RGB expects three colors while grayscale only has one. When the conversion occurs, the RGB values are filled with the same value which is found in the grayscale pixel.


Lab is not a colorspace which TensorFlow has native support for. It's a useful colorspace because it can map to a larger number of perceivable colors than RGB. Although TensorFlow doesn't support this natively, it is a colorspace which is often used in professional settings. Another Python library python-colormath has support for Lab conversion as well as other colorspaces not described here.

The largest benefit using a Lab colorspace is it maps closer to humans perception of the difference in colors than RGB or HSV. The euclidean distance between two colors in a Lab colorspace are somewhat representative of how different the colors look to a human.

Casting Images

In these examples, tf.to_float is often used in order to illustrate changing an image's type to another format. For examples, this works OK but TensorFlow has a built in function to properly scale values as they change types. tf.image.convert_image_dtype(image, dtype, saturate=False) is a useful shortcut to change the type of an image from tf.uint8 to tf.float.