In [ ]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
Let's use scikit-image to load the content of a JPEG file into a numpy array:
In [ ]:
from skimage.io import imread
image = imread('laptop.jpeg')
type(image)
The dimensions of the array are:
In [ ]:
image.shape
For efficiency reasons, the pixel intensities of each channel are stored as 8-bit unsigned integer taking values in the [0-255] range:
In [ ]:
image.dtype
In [ ]:
image.min(), image.max()
In [ ]:
plt.imshow(image);
In [ ]:
np.product(image.shape)
In [ ]:
450 * 800 * 3 * (8 / 8)
Let's check by asking numpy:
In [ ]:
print("image size: {:0.3} MB".format(image.nbytes / 1e6))
Indexing on the last dimension makes it possible to extract the 2D content of a specific color channel, for instance the red channel:
In [ ]:
red_channel = image[:, :, 0]
red_channel
In [ ]:
plt.imshow(image[:, :, 0], cmap=plt.cm.Reds_r);
Compute a grey-level version of the image with shape (height, width)
by averaging the values across color channels using image.mean.
Plot the result with plt.imshow
using a grey levels colormap.
Can the uint8 integer data type represent those average values? Check the data type used by numpy.
What is the size in (mega) bytes of this image?
What are the expected range of values for the new pixels?
In [ ]:
In [ ]:
# %load solutions/grey_levels.py
When dealing with an heterogeneous collection of image of various sizes, it is often necessary to resize the image to the same size. More specifically:
for image classification, most networks expect a specific fixed input size;
for object detection and instance segmentation, networks have more flexibility but the image should have approximately the same size as the training set images.
Furthermore large images can be much slower to process than smaller images (the number of pixels varies quadratically with the height and width).
In [ ]:
from skimage.transform import resize
image = imread('laptop.jpeg')
lowres_image = resize(image, (50, 50), mode='reflect', anti_aliasing=True)
lowres_image.shape
In [ ]:
plt.imshow(lowres_image, interpolation='nearest');
The values of the pixels of the low resolution image are computed from by combining the values of the pixels in the high resolution image. The result is therefore represented as floating points.
In [ ]:
lowres_image.dtype
By conventions, both skimage.transform.imresize
and plt.imshow
assume that floating point values range from 0.0 to 1.0 when using floating points as opposed to 0 to 255 when using 8-bit integers:
In [ ]:
lowres_image.min(), lowres_image.max()
Note that keras on the other hand might expect images encoded with values in the [0.0 - 255.0]
range irrespectively of the data type of the array. To avoid the implicit conversion to the [0.0 - 1.0]
range we use the preserve_range=True
option.
In [ ]:
lowres_large_range_image = resize(image, (50, 50), mode='reflect',
anti_aliasing=True, preserve_range=True)
In [ ]:
lowres_large_range_image.shape
In [ ]:
lowres_large_range_image.dtype
In [ ]:
lowres_large_range_image.min(), lowres_large_range_image.max()
Warning: the behavior of plt.imshow
depends on both the dtype and the dynamic range when displaying RGB images. In particular it does not work on RGB images with float64 values in the [0.0 - 255.0]
range:
In [ ]:
plt.imshow(lowres_large_range_image, interpolation='nearest');
In [ ]:
# %load solutions/question_imshow_dtype_and_range.py
In [ ]:
import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing import image
model = ResNet50(include_top=True, weights='imagenet')
In [ ]:
print(model.summary())
Exercise
decode_predictions
from KerasNotes:
"images_resize/000007.jpg"
preprocess_input
for preprocessing the image. (224, 224)
with a dynamic in [0, 255]
before preprocessing. skimage's resize has a preserve_range
flag that you might find useful.
In [ ]:
from tensorflow.keras.applications.imagenet_utils import preprocess_input
from tensorflow.keras.applications.imagenet_utils import decode_predictions
path = "laptop.jpeg"
# TODO
In [ ]:
# %load solutions/predict_image.py
Let's use the python API of OpenCV to take pictures.
In [ ]:
import cv2
def camera_grab(camera_id=0, fallback_filename=None):
camera = cv2.VideoCapture(camera_id)
try:
# take 10 consecutive snapshots to let the camera automatically tune
# itself and hope that the contrast and lighting of the last snapshot
# is good enough.
for i in range(10):
snapshot_ok, image = camera.read()
if snapshot_ok:
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
else:
print("WARNING: could not access camera")
if fallback_filename:
image = imread(fallback_filename)
finally:
camera.release()
return image
In [ ]:
image = camera_grab(camera_id=0, fallback_filename='laptop.jpeg')
plt.imshow(image)
print("dtype: {}, shape: {}, range: {}".format(
image.dtype, image.shape, (image.min(), image.max())))
Write a function named classify
that takes a snapshot of the webcam and displays it along with the decoded predictions of model and their confidence level.
If you don't have access to a webcam take a picture with your mobile phone or a photo of your choice from the web, store it as a JPEG file on the disk instead and pass that file to the neural network to make the prediction.
Try to classify a photo of your face. Look at the confidence level. Can you explain the results?
Try to classify photos of common objects such as a book, a mobile phone, a cup... Try to center the objects and remove clutter to get confidence higher than 0.5.
In [ ]:
def classify():
# TODO: write me
pass
classify()
In [ ]:
# %load solutions/classify_webcam.py
In [ ]:
import os.path as op
from zipfile import ZipFile
if not op.exists("images_resize"):
print('Extracting image files...')
zf = ZipFile('images_pascalVOC.zip')
zf.extractall('.')
Let's build a new model that maps the image input space to the output of the layer before the last layer of the pretrained Resnet 50 model. We call this new model the "base model":
In [ ]:
input = model.layers[0].input
output = model.layers[-2].output
base_model = Model(input, output)
base_model.output_shape
The base model can transform any image into a flat, high dimensional, semantic feature vector:
In [ ]:
representation = base_model.predict(img_batch)
print("Shape of representation:", representation.shape)
Computing representations of all images can be time consuming. This is usually made by large batches on a GPU for massive performance gains.
For the remaining part, we will use pre-computed representations saved in h5 format.
For those interested, this is done using the process_images.py
script
In [ ]:
import os
paths = ["images_resize/" + path
for path in sorted(os.listdir("images_resize/"))]
In [ ]:
import h5py
with h5py.File('img_emb.h5', 'r') as h5f:
out_tensors = h5f['img_emb'][:]
out_tensors.shape
In [ ]:
out_tensors.dtype
Exercise
In [ ]:
# %load solutions/representations.py
Let's find a 2D representation of that high dimensional feature space using T-SNE:
In [ ]:
from sklearn.manifold import TSNE
img_emb_tsne = TSNE(perplexity=30).fit_transform(out_tensors)
In [ ]:
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 10))
plt.scatter(img_emb_tsne[:, 0], img_emb_tsne[:, 1]);
plt.xticks(()); plt.yticks(());
plt.show()
Let's add thumnails of the original images at their TSNE locations:
In [ ]:
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
from skimage.io import imread
from skimage.transform import resize
def imscatter(x, y, paths, ax=None, zoom=1, linewidth=0):
if ax is None:
ax = plt.gca()
x, y = np.atleast_1d(x, y)
artists = []
for x0, y0, p in zip(x, y, paths):
try:
im = imread(p)
except:
print(p)
continue
im = resize(im, (224, 224), preserve_range=False, mode='reflect')
im = OffsetImage(im, zoom=zoom)
ab = AnnotationBbox(im, (x0, y0), xycoords='data',
frameon=True, pad=0.1,
bboxprops=dict(edgecolor='red',
linewidth=linewidth))
artists.append(ax.add_artist(ab))
ax.update_datalim(np.column_stack([x, y]))
ax.autoscale()
return artists
In [ ]:
fig, ax = plt.subplots(figsize=(50, 50))
imscatter(img_emb_tsne[:, 0], img_emb_tsne[:, 1], paths, zoom=0.5, ax=ax)
plt.savefig('tsne.png')
In [ ]:
def display(img):
plt.figure()
img = imread(img)
plt.imshow(img)
In [ ]:
idx = 57
def most_similar(idx, top_n=5):
dists = np.linalg.norm(out_tensors - out_tensors[idx], axis = 1)
sorted_dists = np.argsort(dists)
return sorted_dists[:top_n]
sim = most_similar(idx)
[display(paths[s]) for s in sim];
Using these representations, it may be possible to build a nearest neighbor classifier. However, the representations are learnt on ImageNet, which are centered images, when we input images from PascalVOC, more plausible inputs of a real world system.
The next section explores this possibility by computing the histogram of similarities between one image and the others.
In [ ]:
out_norms = np.linalg.norm(out_tensors, axis=1, keepdims=True)
normed_out_tensors = out_tensors / out_norms
In [ ]:
item_idx = 208
dists_to_item = np.linalg.norm(out_tensors - out_tensors[item_idx],
axis=1)
cos_to_item = np.dot(normed_out_tensors, normed_out_tensors[item_idx])
plt.hist(cos_to_item, bins=30)
display(paths[item_idx])
Unfortunately there is no clear separation of class boundaries visible in the histogram of similarities alone. We need some supervision to be able to classify images.
With a labeled dataset, even with very little labels per class, one would be able to:
These approximate classifiers are useful in practice.
See the cat vs dog
home assignment with GPU for another example of this approach.
In [ ]:
items = np.where(cos_to_item > 0.5)
print(items)
[display(paths[s]) for s in items[0]];
In [ ]: