The goal of this notebook is get some hands-on experience with pre-trained Keras models that are reasonably close to the state of the art of some computer vision tasks. The models are pre-trained on large publicly available labeled images datasets such as ImageNet and COCO.
This notebook highlights two specific tasks:
Image classification: predict only one class label per-image (assuming a single centered object or image class)
Object detection and instance segmentation: detect and localise all occurrences of objects of a predefined list of classes of interest in a given image.
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
Let's use scikit-image to load the content of a JPEG file into a numpy array:
In [ ]:
from skimage.io import imread
image = imread('laptop.jpeg')
type(image)
The dimensions of the array are:
In [ ]:
image.shape
For efficiency reasons, the pixel intensities of each channel are stored as 8-bit unsigned integer taking values in the [0-255] range:
In [ ]:
image.dtype
In [ ]:
image.min(), image.max()
In [ ]:
plt.imshow(image);
In [ ]:
image.shape
In [ ]:
np.product(image.shape)
In [ ]:
image.dtype
In [ ]:
450 * 800 * 3 * (8 / 8)
Let's check by asking numpy:
In [ ]:
image.nbytes
In [ ]:
print("image size: {:0.3} MB".format(image.nbytes / 1e6))
Indexing on the last dimension makes it possible to extract the 2D content of a specific color channel, for instance the red channel:
In [ ]:
red_channel = image[:, :, 0]
red_channel
In [ ]:
red_channel.min(), red_channel.max()
In [ ]:
plt.imshow(image[:, :, 0], cmap=plt.cm.Reds_r);
Compute a grey-level version of the image with shape (height, width)
by averaging the values across color channels using image.mean.
Plot the result with plt.imshow
using a grey levels colormap.
Can the uint8 integer data type represent those average values? Check the data type used by numpy.
What is the size in (mega) bytes of this image?
What are the expected range of values for the new pixels?
In [ ]:
In [ ]:
# %load solutions/grey_levels.py
When dealing with an heterogeneous collection of image of various sizes, it is often necessary to resize the image to the same size. More specifically:
for image classification, most networks expect a specific fixed input size;
for object detection and instance segmentation, networks have more flexibility but the image should have approximately the same size as the training set images.
Furthermore large images can be much slower to process than smaller images (the number of pixels varies quadratically with the height and width).
In [ ]:
from skimage.transform import resize
image = imread('laptop.jpeg')
lowres_image = resize(image, (50, 50), mode='reflect', anti_aliasing=True)
lowres_image.shape
In [ ]:
plt.imshow(lowres_image, interpolation='nearest');
The values of the pixels of the low resolution image are computed from by combining the values of the pixels in the high resolution image. The result is therefore represented as floating points.
In [ ]:
lowres_image.dtype
In [ ]:
print("image size: {:0.3} MB".format(lowres_image.nbytes / 1e6))
By conventions, both skimage.transform.imresize
and plt.imshow
assume that floating point values range from 0.0 to 1.0 when using floating points as opposed to 0 to 255 when using 8-bit integers:
In [ ]:
lowres_image.min(), lowres_image.max()
Note that keras on the other hand might expect images encoded with values in the [0.0 - 255.0]
range irrespectively of the data type of the array. To avoid the implicit conversion to the [0.0 - 1.0]
range we use the preserve_range=True
option.
In [ ]:
lowres_large_range_image = resize(image, (50, 50), mode='reflect',
anti_aliasing=True, preserve_range=True)
In [ ]:
lowres_large_range_image.shape
In [ ]:
lowres_large_range_image.dtype
In [ ]:
lowres_large_range_image.min(), lowres_large_range_image.max()
Warning: the behavior of plt.imshow
depends on both the dtype and the dynamic range when displaying RGB images. In particular it does not work on RGB images with float64 values in the [0.0 - 255.0]
range:
In [ ]:
plt.imshow(lowres_large_range_image, interpolation='nearest');
Let's use the python API of OpenCV to take pictures.
In [ ]:
import cv2
def camera_grab(camera_id=0, fallback_filename=None):
camera = cv2.VideoCapture(camera_id)
try:
# take 10 consecutive snapshots to let the camera automatically tune
# itself and hope that the contrast and lighting of the last snapshot
# is good enough.
for i in range(10):
snapshot_ok, image = camera.read()
if snapshot_ok:
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
else:
print("WARNING: could not access camera")
if fallback_filename:
image = imread(fallback_filename)
finally:
camera.release()
return image
In [ ]:
image = camera_grab(camera_id=0, fallback_filename='laptop.jpeg')
plt.imshow(image)
print("dtype: {}, shape: {}, range: {}".format(
image.dtype, image.shape, (image.min(), image.max())))
In [ ]:
from tensorflow.keras.applications.resnet50 import preprocess_input, ResNet50
model = ResNet50(weights='imagenet')
Let's check that tensorflow backend used by Keras as the default backend expect the color channel on the last axis. If it had not been the case, it would have been possible to change the order of the axes with images = images.transpose(2, 0, 1)
.
In [ ]:
import tensorflow.keras.backend as K
K.image_data_format()
The network has been trained on (224, 224) RGB images.
In [ ]:
model.input_shape
None
is used by Keras to mark dimensions with a dynamic number of elements. In this case None
is the "batch size", that is the number of images that can be processed at once. In the following we will process only one image at a time.
In [ ]:
image = imread('laptop.jpeg')
image_224 = resize(image, (224, 224), preserve_range=True, mode='reflect')
In [ ]:
image_224.shape
In [ ]:
image_224.dtype
In [ ]:
image_224 = image_224.astype(np.float32)
image_224.dtype
In [ ]:
plt.imshow(image_224 / 255);
Note that the image has been deformed by the resizing. In practice this should not degrade the performance of the network too much.
An alternative would be to:
In [ ]:
model.input_shape
In [ ]:
image_224.shape
In [ ]:
image_224_batch = np.expand_dims(image_224, axis=0)
# Or image_224_batch = image_224[None, ...] if you are familiar with broadcasting in numpy
image_224_batch.shape
image_224_batch
is now compatible with the input shape of the neural network, let's make a prediction.
In [ ]:
%%time
x = preprocess_input(image_224_batch.copy())
preds = model.predict(x)
The output predictions are a 2D array:
In [ ]:
type(preds)
In [ ]:
preds.dtype
In [ ]:
preds.shape
In [ ]:
preds.sum(axis=1)
In [ ]:
from tensorflow.keras.applications.resnet50 import decode_predictions
decode_predictions(preds, top=5)
In [ ]:
print('Predicted image labels:')
class_names, confidences = [], []
for class_id, class_name, confidence in decode_predictions(preds, top=5)[0]:
print(" {} (synset: {}): {:0.3f}".format(class_name, class_id, confidence))
Check on imagenet to better understand the use of the term "notebook" in the training set: http://image-net.org/search?q=notebook.
Note that the network in not too confident about the class of the main object in that image. If we were to merge the "notebook" and "laptop" classes, this prediction would be good.
Furthermore the network also considers secondary objects ("desk", "mouse"...) but the model as been trained as an image (multiclass) classification model with a single expected class per image rather than a multi-label classification model such as an object detection model with several positive labels per image.
We have to keep that in mind when trying to make use of the predictions of such a model for a practical application. This is a fundamental limitation of the label structure of the training set.
All Keras pretrained vision models expect images with float32 dtype and values in the [0, 255] range. When training neural network it often works better to have values closer to zero.
A typical preprocessing is to center each the channel and normalize its variance.
Another is to measure the min and max values and to shift and rescale to the (-1.0, 1.0)
range.
The exact kind of preprocessing is not very important, but it's very important to always reuse the preprocessing function that was used when training the model.
In [ ]:
image = imread('laptop.jpeg')
image_224 = resize(image, (224, 224), preserve_range=True, mode='reflect')
image_224_batch = np.expand_dims(image_224, axis=0)
image_224_batch.min(), image_224_batch.max()
In [ ]:
preprocessed_batch = preprocess_input(image_224_batch.copy())
In [ ]:
preprocessed_batch.min(), preprocessed_batch.max()
Note that we make a copy each time as preprocess_input
can modify the image inplace to reuse memory when preprocessing large datasets.
Write a function named classify
that takes a snapshot of the webcam and displays it along with the decoded predictions of model and their confidence level.
If you don't have access to a webcam take a picture with your mobile phone or a photo of your choice from the web, store it as a JPEG file on the disk instead and pass that file to the neural network to make the prediction.
Try to classify a photo of your face. Look at the confidence level. Can you explain the results?
Try to classify photos of common objects such as a book, a mobile phone, a cup... Try to center the objects and remove clutter to get confidence higher than 0.5.
In [ ]:
def classify():
# TODO: write me
pass
classify()
In [ ]:
# %load solutions/classify_webcam.py
Use the "MobileNet" and "Inception Resnet v2" models from keras.applications
instead of Resnet 50 to classify images from the webcam or stored as a JPEG file.
Read the documentation for more details on the expected input shape and preprocessing:
https://keras.io/applications/
Measure prediction time using %%time
to compare to Resnet 50.
To time the execution of a notebook cell, you can use the %%time
magic command. Here is an example:
In [ ]:
%%time
a = 0
for i in range(10000000):
a += 1
print('Computation complete!')