Pretrained Models for Computer Vision

The goal of this notebook is get some hands-on experience with pre-trained Keras models that are reasonably close to the state of the art of some computer vision tasks. The models are pre-trained on large publicly available labeled images datasets such as ImageNet and COCO.

This notebook highlights two specific tasks:

Image classification: predict only one class label per-image (assuming a single centered object or image class)
Object detection and instance segmentation: detect and localise all occurrences of objects of a predefined list of classes of interest in a given image.



In [ ]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

Loading a JPEG file as a numpy array

Let's use scikit-image to load the content of a JPEG file into a numpy array:



In [ ]:

    
from skimage.io import imread

image = imread('laptop.jpeg')
type(image)

The dimensions of the array are:

height
width
color channels (RGB)



In [ ]:

    
image.shape

For efficiency reasons, the pixel intensities of each channel are stored as 8-bit unsigned integer taking values in the [0-255] range:



In [ ]:

    
image.dtype



In [ ]:

    
image.min(), image.max()



In [ ]:

    
plt.imshow(image);

Size of a numpy array

The size in bytes can be computed by multiplying the number of element by the size in byte of each element in the array.

The size of one element depend of the data type.

1 byte == 8 bits

A byte in English is an octet in French.



In [ ]:

    
image.shape



In [ ]:

    
np.product(image.shape)



In [ ]:

    
image.dtype



In [ ]:

    
450 * 800 * 3 * (8 / 8)

Let's check by asking numpy:



In [ ]:

    
image.nbytes



In [ ]:

    
print("image size: {:0.3} MB".format(image.nbytes / 1e6))

Indexing on the last dimension makes it possible to extract the 2D content of a specific color channel, for instance the red channel:



In [ ]:

    
red_channel = image[:, :, 0]
red_channel



In [ ]:

    
red_channel.min(), red_channel.max()



In [ ]:

    
plt.imshow(image[:, :, 0], cmap=plt.cm.Reds_r);

Exercise

Compute a grey-level version of the image with shape (height, width) by averaging the values across color channels using image.mean.
Plot the result with plt.imshow using a grey levels colormap.
Can the uint8 integer data type represent those average values? Check the data type used by numpy.
What is the size in (mega) bytes of this image?
What are the expected range of values for the new pixels?



In [ ]:



In [ ]:

    
# %load solutions/grey_levels.py

Resizing images, handling data types and dynamic ranges

When dealing with an heterogeneous collection of image of various sizes, it is often necessary to resize the image to the same size. More specifically:

for image classification, most networks expect a specific fixed input size;
for object detection and instance segmentation, networks have more flexibility but the image should have approximately the same size as the training set images.

Furthermore large images can be much slower to process than smaller images (the number of pixels varies quadratically with the height and width).



In [ ]:

    
from skimage.transform import resize

image = imread('laptop.jpeg')
lowres_image = resize(image, (50, 50), mode='reflect', anti_aliasing=True)
lowres_image.shape



In [ ]:

    
plt.imshow(lowres_image, interpolation='nearest');

The values of the pixels of the low resolution image are computed from by combining the values of the pixels in the high resolution image. The result is therefore represented as floating points.



In [ ]:

    
lowres_image.dtype



In [ ]:

    
print("image size: {:0.3} MB".format(lowres_image.nbytes / 1e6))

By conventions, both skimage.transform.imresize and plt.imshow assume that floating point values range from 0.0 to 1.0 when using floating points as opposed to 0 to 255 when using 8-bit integers:



In [ ]:

    
lowres_image.min(), lowres_image.max()

Note that keras on the other hand might expect images encoded with values in the [0.0 - 255.0] range irrespectively of the data type of the array. To avoid the implicit conversion to the [0.0 - 1.0] range we use the preserve_range=True option.



In [ ]:

    
lowres_large_range_image = resize(image, (50, 50), mode='reflect',
                                  anti_aliasing=True, preserve_range=True)



In [ ]:

    
lowres_large_range_image.shape



In [ ]:

    
lowres_large_range_image.dtype



In [ ]:

    
lowres_large_range_image.min(), lowres_large_range_image.max()

Warning: the behavior of plt.imshow depends on both the dtype and the dynamic range when displaying RGB images. In particular it does not work on RGB images with float64 values in the [0.0 - 255.0] range:



In [ ]:

    
plt.imshow(lowres_large_range_image, interpolation='nearest');

Taking snapshots from the webcam

Let's use the python API of OpenCV to take pictures.



In [ ]:

    
import cv2

def camera_grab(camera_id=0, fallback_filename=None):
    camera = cv2.VideoCapture(camera_id)
    try:
        # take 10 consecutive snapshots to let the camera automatically tune
        # itself and hope that the contrast and lighting of the last snapshot
        # is good enough.
        for i in range(10):
            snapshot_ok, image = camera.read()
        if snapshot_ok:
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        else:
            print("WARNING: could not access camera")
            if fallback_filename:
                image = imread(fallback_filename)
    finally:
        camera.release()
    return image



In [ ]:

    
image = camera_grab(camera_id=0, fallback_filename='laptop.jpeg')
plt.imshow(image)
print("dtype: {}, shape: {}, range: {}".format(
    image.dtype, image.shape, (image.min(), image.max())))

Image Classification

The Keras library includes several neural network model pretrained on the Image Net classification dataset. A popular model that shows a good tradeoff between computation speed, model size and accuracy is called ResNet-50:



In [ ]:

    
from tensorflow.keras.applications.resnet50 import preprocess_input, ResNet50

model = ResNet50(weights='imagenet')

Let's check that tensorflow backend used by Keras as the default backend expect the color channel on the last axis. If it had not been the case, it would have been possible to change the order of the axes with images = images.transpose(2, 0, 1).



In [ ]:

    
import tensorflow.keras.backend as K

K.image_data_format()

The network has been trained on (224, 224) RGB images.



In [ ]:

    
model.input_shape

None is used by Keras to mark dimensions with a dynamic number of elements. In this case None is the "batch size", that is the number of images that can be processed at once. In the following we will process only one image at a time.



In [ ]:

    
image = imread('laptop.jpeg')
image_224 = resize(image, (224, 224), preserve_range=True, mode='reflect')



In [ ]:

    
image_224.shape



In [ ]:

    
image_224.dtype



In [ ]:

    
image_224 = image_224.astype(np.float32)
image_224.dtype



In [ ]:

    
plt.imshow(image_224 / 255);

Note that the image has been deformed by the resizing. In practice this should not degrade the performance of the network too much.

An alternative would be to:

resize the image so that the smallest side is set to 224;
extract a square centered crop of size (224, 224) from the resulting image.



In [ ]:

    
model.input_shape



In [ ]:

    
image_224.shape



In [ ]:

    
image_224_batch = np.expand_dims(image_224, axis=0)
# Or image_224_batch = image_224[None, ...] if you are familiar with broadcasting in numpy
image_224_batch.shape

image_224_batch is now compatible with the input shape of the neural network, let's make a prediction.



In [ ]:

    
%%time
x = preprocess_input(image_224_batch.copy())
preds = model.predict(x)

The output predictions are a 2D array:

1 row per image in the batch,
1 column per target class in the ImageNet LSVRC dataset (1000 possible classes).



In [ ]:

    
type(preds)



In [ ]:

    
preds.dtype



In [ ]:

    
preds.shape



In [ ]:

    
preds.sum(axis=1)

Decoding the Prediction Probabilities

Reading the raw probabilities for the 1000 possible Image Net classes is tedious. Fortunately Keras comes with an helper function to extract the highest rated classes according to the model and display both the class names and the wordnet synset identifiers:



In [ ]:

    
from tensorflow.keras.applications.resnet50 import decode_predictions

decode_predictions(preds, top=5)



In [ ]:

    
print('Predicted image labels:')
class_names, confidences = [], []
for class_id, class_name, confidence in decode_predictions(preds, top=5)[0]:
    print("    {} (synset: {}): {:0.3f}".format(class_name, class_id, confidence))

Check on imagenet to better understand the use of the term "notebook" in the training set: http://image-net.org/search?q=notebook.

Note that the network in not too confident about the class of the main object in that image. If we were to merge the "notebook" and "laptop" classes, this prediction would be good.

Furthermore the network also considers secondary objects ("desk", "mouse"...) but the model as been trained as an image (multiclass) classification model with a single expected class per image rather than a multi-label classification model such as an object detection model with several positive labels per image.

We have to keep that in mind when trying to make use of the predictions of such a model for a practical application. This is a fundamental limitation of the label structure of the training set.

A note on preprocessing

All Keras pretrained vision models expect images with float32 dtype and values in the [0, 255] range. When training neural network it often works better to have values closer to zero.

A typical preprocessing is to center each the channel and normalize its variance.
Another is to measure the min and max values and to shift and rescale to the (-1.0, 1.0) range.

The exact kind of preprocessing is not very important, but it's very important to always reuse the preprocessing function that was used when training the model.



In [ ]:

    
image = imread('laptop.jpeg')
image_224 = resize(image, (224, 224), preserve_range=True, mode='reflect')
image_224_batch = np.expand_dims(image_224, axis=0)
image_224_batch.min(), image_224_batch.max()



In [ ]:

    
preprocessed_batch = preprocess_input(image_224_batch.copy())



In [ ]:

    
preprocessed_batch.min(), preprocessed_batch.max()

Note that we make a copy each time as preprocess_input can modify the image inplace to reuse memory when preprocessing large datasets.

Exercise

Write a function named classify that takes a snapshot of the webcam and displays it along with the decoded predictions of model and their confidence level.
If you don't have access to a webcam take a picture with your mobile phone or a photo of your choice from the web, store it as a JPEG file on the disk instead and pass that file to the neural network to make the prediction.
Try to classify a photo of your face. Look at the confidence level. Can you explain the results?
Try to classify photos of common objects such as a book, a mobile phone, a cup... Try to center the objects and remove clutter to get confidence higher than 0.5.



In [ ]:

    
def classify():
    # TODO: write me
    pass

    
classify()



In [ ]:

    
# %load solutions/classify_webcam.py

Home assignement #1

Use the "MobileNet" and "Inception Resnet v2" models from keras.applications instead of Resnet 50 to classify images from the webcam or stored as a JPEG file.

Read the documentation for more details on the expected input shape and preprocessing:

https://keras.io/applications/

Measure prediction time using %%time to compare to Resnet 50.

To time the execution of a notebook cell, you can use the %%time magic command. Here is an example:



In [ ]:

    
%%time

a = 0
for i in range(10000000):
    a += 1
print('Computation complete!')