In [1]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from PIL import ImageDraw
from PIL import ImageColor
import time
from scipy.stats import norm
%matplotlib inline
plt.style.use('ggplot')
MobileNets, as the name suggests, are neural networks constructed for the purpose of running very efficiently (high FPS, low memory footprint) on mobile and embedded devices. MobileNets achieve this with 3 techniques:
These 3 techiniques reduce the size of cummulative parameters and therefore the computation required. Of course, generally models with more paramters achieve a higher accuracy. MobileNets are no silver bullet, while they perform very well larger models will outperform them. MobileNets are designed for mobile devices, NOT cloud GPUs. The reason we're using them in this lab is automotive hardware is closer to mobile or embedded devices than beefy cloud GPUs.
Before we get into the MobileNet convolution block let's take a step back and recall the computational cost of a vanilla convolution. There are $N$ kernels of size $D_k * D_k$. Each of these kernels goes over the entire input which is a $D_f * D_f * M$ sized feature map or tensor (if that makes more sense). The computational cost is:
$$ D_f * D_f * M * N * D_k * D_k $$Let $D_g * D_g$ be the size of the output feature map. Then a standard convolution takes in a $D_f * D_f * M$ input feature map and returns a $D_g * D_g * N$ feature map as output.
A depthwise convolution acts on each input channel separately with a different kernel. $M$ input channels implies there are $M$ $D_k * D_k$ kernels. Also notice this results in $N$ being set to 1. If this doesn't make sense, think about the shape a kernel would have to be to act upon an inidividual channel.
Computation cost:
$$ D_f * D_f * M * D_k * D_k $$A pointwise convolution performs a 1x1 convolution, it's the same as a vanilla convolution except the kernel size is $1 * 1$.
Computation cost:
$$ D_k * D_k * D_f * D_f * M * N = 1 * 1 * D_f * D_f * M * N = D_f * D_f * M * N $$Thus the total computation cost is for separable depthwise convolution:
$$ D_f * D_f * M * D_k * D_k + D_f * D_f * M * N $$which results in $\frac{1}{N} + \frac{1}{D_k^2}$ reduction in computation:
$$ \frac {D_f * D_f * M * D_k * D_k + D_f * D_f * M * N} {D_f * D_f * M * N * D_k * D_k} = \frac {D_k^2 + N} {D_k^2*N} = \frac {1}{N} + \frac{1}{D_k^2} $$MobileNets use a 3x3 kernel, so assuming a large enough $N$, separable depthwise convnets are ~9x more computationally efficient than vanilla convolutions!
The 2nd technique for reducing the computational cost is the "width multiplier" which is a hyperparameter inhabiting the range [0, 1] denoted here as $\alpha$. $\alpha$ reduces the number of input and output channels proportionally:
$$ D_f * D_f * \alpha M * D_k * D_k + D_f * D_f * \alpha M * \alpha N $$The 3rd technique for reducing the computational cost is the "resolution multiplier" which is a hyperparameter inhabiting the range [0, 1] denoted here as $\rho$. $\rho$ reduces the size of the input feature map:
$$ \rho D_f * \rho D_f * M * D_k * D_k + \rho D_f * \rho D_f * M * N $$Combining the width and resolution multipliers results in a computational cost of:
$$ \rho D_f * \rho D_f * a M * D_k * D_k + \rho D_f * \rho D_f * a M * a N $$Training MobileNets with different values of $\alpha$ and $\rho$ will result in different speed vs. accuracy tradeoffs. The folks at Google have run these experiments, the result are shown in the graphic below:
MACs (M) represents the number of multiplication-add operations in the millions.
In [2]:
def vanilla_conv_block(x, kernel_size, output_channels):
"""
Vanilla Conv -> Batch Norm -> ReLU
"""
x = tf.layers.conv2d(
x, output_channels, kernel_size, (2, 2), padding='SAME')
x = tf.layers.batch_normalization(x)
return tf.nn.relu(x)
# TODO: implement MobileNet conv block
def mobilenet_conv_block(x, kernel_size, output_channels):
"""
Depthwise Conv -> Batch Norm -> ReLU -> Pointwise Conv -> Batch Norm -> ReLU
"""
f = tf.random_normal([IMG_HEIGHT, IMG_WIDTH, INPUT_CHANNELS, 1])
x = tf.nn.depthwise_conv2d(x, f, strides=[1, 1, 1, 1], padding="VALID")
x = tf.layers.batch_normalization(x)
x = tf.nn.relu(x)
x = tf.layers.conv2d(x, output_channels, 1, (2, 2), padding="VALID")
x = tf.layers.batch_normalization(x)
return tf.nn.relu(x)
Let's compare the number of parameters in each block.
In [3]:
# constants but you can change them so I guess they're not so constant :)
INPUT_CHANNELS = 32
OUTPUT_CHANNELS = 512
KERNEL_SIZE = 3
IMG_HEIGHT = 256
IMG_WIDTH = 256
with tf.Session(graph=tf.Graph()) as sess:
# input
x = tf.constant(np.random.randn(1, IMG_HEIGHT, IMG_WIDTH, INPUT_CHANNELS), dtype=tf.float32)
with tf.variable_scope('vanilla'):
vanilla_conv = vanilla_conv_block(x, KERNEL_SIZE, OUTPUT_CHANNELS)
with tf.variable_scope('mobile'):
mobilenet_conv = mobilenet_conv_block(x, KERNEL_SIZE, OUTPUT_CHANNELS)
vanilla_params = [
(v.name, np.prod(v.get_shape().as_list()))
for v in tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'vanilla')
]
mobile_params = [
(v.name, np.prod(v.get_shape().as_list()))
for v in tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'mobile')
]
print("VANILLA CONV BLOCK")
total_vanilla_params = sum([p[1] for p in vanilla_params])
for p in vanilla_params:
print("Variable {0}: number of params = {1}".format(p[0], p[1]))
print("Total number of params =", total_vanilla_params)
print()
print("MOBILENET CONV BLOCK")
total_mobile_params = sum([p[1] for p in mobile_params])
for p in mobile_params:
print("Variable {0}: number of params = {1}".format(p[0], p[1]))
print("Total number of params =", total_mobile_params)
print()
print("{0:.3f}x parameter reduction".format(total_vanilla_params /
total_mobile_params))
Your solution should show the majority of the parameters in MobileNet block stem from the pointwise convolution.
In this section you'll use a pretrained MobileNet SSD model to perform object detection. You can download the MobileNet SSD and other models from the TensorFlow detection model zoo. Paper describing comparing several object detection models.
Alright, let's get into SSD!
Many previous works in object detection involve more than one training phase. For example, the Faster-RCNN architecture first trains a Region Proposal Network (RPN) which decides which regions of the image are worth drawing a box around. RPN is then merged with a pretrained model for classification (classifies the regions). The image below is an RPN:
The SSD architecture is a single convolutional network which learns to predict bounding box locations and classify the locations in one pass. Put differently, SSD can be trained end to end while Faster-RCNN cannot. The SSD architecture consists of a base network followed by several convolutional layers:
NOTE: In this lab the base network is a MobileNet (instead of VGG16.)
SSD operates on feature maps to predict bounding box locations. Recall a feature map is of size $D_f * D_f * M$. For each feature map location $k$ bounding boxes are predicted. Each bounding box carries with it the following information:
SSD does not predict the shape of the box, rather just where the box is. The $k$ bounding boxes each have a predetermined shape. This is illustrated in the figure below:
The shapes are set prior to actual training. For example, In figure (c) in the above picture there are 4 boxes, meaning $k$ = 4.
A: This is done so the network can recognize object classes at different scales. Each feature map scale is trained on all classes with augmented data at all supported resolution scales (1x, 2x, 3x, 1/2x, 1/3x). This helps improve the overall accuracy of the model.
A: This is done in two ways:
With the final set of matched boxes we can compute the loss:
$$ L = \frac {1} {N} * ( L_{class} + L_{box}) $$where $N$ is the total number of matched boxes, $L_{class}$ is a softmax loss for classification, and $L_{box}$ is a L1 smooth loss representing the error of the matched boxes with the ground truth boxes. L1 smooth loss is a modification of L1 loss which is more robust to outliers. In the event $N$ is 0 the loss is set 0.
In this part of the lab you'll detect objects using pretrained object detection models. You can download the pretrained models from the model zoo.
In [4]:
# Frozen inference graph files. NOTE: change the path to where you saved the models.
SSD_GRAPH_FILE = 'ssd_mobilenet_v1_coco_2017_11_17/frozen_inference_graph.pb'
RFCN_GRAPH_FILE = 'rfcn_resnet101_coco_2018_01_28/frozen_inference_graph.pb'
FASTER_RCNN_GRAPH_FILE = 'faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28/frozen_inference_graph.pb'
Below are utility functions. The main purpose of these is to draw the bounding boxes back onto the original image.
In [5]:
# Colors (one for each class)
cmap = ImageColor.colormap
print("Number of colors =", len(cmap))
COLOR_LIST = sorted([c for c in cmap.keys()])
#
# Utility funcs
#
def filter_boxes(min_score, boxes, scores, classes):
"""Return boxes with a confidence >= `min_score`"""
n = len(classes)
idxs = []
for i in range(n):
if scores[i] >= min_score:
idxs.append(i)
filtered_boxes = boxes[idxs, ...]
filtered_scores = scores[idxs, ...]
filtered_classes = classes[idxs, ...]
return filtered_boxes, filtered_scores, filtered_classes
def to_image_coords(boxes, height, width):
"""
The original box coordinate output is normalized, i.e [0, 1].
This converts it back to the original coordinate based on the image
size.
"""
box_coords = np.zeros_like(boxes)
box_coords[:, 0] = boxes[:, 0] * height
box_coords[:, 1] = boxes[:, 1] * width
box_coords[:, 2] = boxes[:, 2] * height
box_coords[:, 3] = boxes[:, 3] * width
return box_coords
def draw_boxes(image, boxes, classes, thickness=4):
"""Draw bounding boxes on the image"""
draw = ImageDraw.Draw(image)
for i in range(len(boxes)):
bot, left, top, right = boxes[i, ...]
class_id = int(classes[i])
color = COLOR_LIST[class_id]
draw.line([(left, top), (left, bot), (right, bot), (right, top), (left, top)], width=thickness, fill=color)
def load_graph(graph_file):
"""Loads a frozen inference graph"""
graph = tf.Graph()
with graph.as_default():
od_graph_def = tf.GraphDef()
with tf.gfile.GFile(graph_file, 'rb') as fid:
serialized_graph = fid.read()
od_graph_def.ParseFromString(serialized_graph)
tf.import_graph_def(od_graph_def, name='')
return graph
Below we load the graph and extract the relevant tensors using get_tensor_by_name. These tensors reflect the input and outputs of the graph, or least the ones we care about for detecting objects.
In [6]:
detection_graph = load_graph(SSD_GRAPH_FILE)
# detection_graph = load_graph(RFCN_GRAPH_FILE)
# detection_graph = load_graph(FASTER_RCNN_GRAPH_FILE)
# The input placeholder for the image.
# `get_tensor_by_name` returns the Tensor with the associated name in the Graph.
image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
# Each box represents a part of the image where a particular object was detected.
detection_boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
# Each score represent how level of confidence for each of the objects.
# Score is shown on the result image, together with the class label.
detection_scores = detection_graph.get_tensor_by_name('detection_scores:0')
# The classification of the object (integer id).
detection_classes = detection_graph.get_tensor_by_name('detection_classes:0')
Run detection and classification on a sample image.
In [7]:
# Load a sample image.
image = Image.open('./assets/sample1.jpg')
image_np = np.expand_dims(np.asarray(image, dtype=np.uint8), 0)
with tf.Session(graph=detection_graph) as sess:
# Actual detection.
(boxes, scores, classes) = sess.run([detection_boxes, detection_scores, detection_classes],
feed_dict={image_tensor: image_np})
# Remove unnecessary dimensions
boxes = np.squeeze(boxes)
scores = np.squeeze(scores)
classes = np.squeeze(classes)
confidence_cutoff = 0.8
# Filter boxes with a confidence score less than `confidence_cutoff`
boxes, scores, classes = filter_boxes(confidence_cutoff, boxes, scores, classes)
# The current box coordinates are normalized to a range between 0 and 1.
# This converts the coordinates actual location on the image.
width, height = image.size
box_coords = to_image_coords(boxes, height, width)
# Each class with be represented by a differently colored box
draw_boxes(image, box_coords, classes)
plt.figure(figsize=(12, 8))
plt.imshow(image)
In [8]:
def time_detection(sess, img_height, img_width, runs=10):
image_tensor = sess.graph.get_tensor_by_name('image_tensor:0')
detection_boxes = sess.graph.get_tensor_by_name('detection_boxes:0')
detection_scores = sess.graph.get_tensor_by_name('detection_scores:0')
detection_classes = sess.graph.get_tensor_by_name('detection_classes:0')
# warmup
gen_image = np.uint8(np.random.randn(1, img_height, img_width, 3))
sess.run([detection_boxes, detection_scores, detection_classes], feed_dict={image_tensor: gen_image})
times = np.zeros(runs)
for i in range(runs):
t0 = time.time()
sess.run([detection_boxes, detection_scores, detection_classes], feed_dict={image_tensor: image_np})
t1 = time.time()
times[i] = (t1 - t0) * 1000
return times
In [9]:
with tf.Session(graph=detection_graph) as sess:
times = time_detection(sess, 600, 1000, runs=10)
In [10]:
# Create a figure instance
fig = plt.figure(1, figsize=(9, 6))
# Create an axes instance
ax = fig.add_subplot(111)
plt.title("Object Detection Timings")
plt.ylabel("Time (ms)")
# Create the boxplot
plt.style.use('fivethirtyeight')
bp = ax.boxplot(times)
Download a few models from the model zoo and compare the timings.
Finally run your pipeline on this short video.
In [11]:
# Import everything needed to edit/save/watch video clips
from moviepy.editor import VideoFileClip
from IPython.display import HTML
In [12]:
HTML("""
<video width="960" height="600" controls>
<source src="{0}" type="video/mp4">
</video>
""".format('driving.mp4'))
Out[12]:
In [13]:
clip = VideoFileClip('driving.mp4')
In [29]:
# TODO: Complete this function.
# The input is an NumPy array.
# The output should also be a NumPy array.
def pipeline(img):
draw_img = Image.fromarray(img)
image_np = np.expand_dims(np.asarray(img, dtype=np.uint8), 0)
# Actual detection.
(boxes, scores, classes) = sess.run([detection_boxes, detection_scores, detection_classes],
feed_dict={image_tensor: image_np})
# Remove unnecessary dimensions
boxes = np.squeeze(boxes)
scores = np.squeeze(scores)
classes = np.squeeze(classes)
confidence_cutoff = 0.8
# Filter boxes with a confidence score less than `confidence_cutoff`
boxes, scores, classes = filter_boxes(confidence_cutoff, boxes, scores, classes)
# The current box coordinates are normalized to a range between 0 and 1.
# This converts the coordinates actual location on the image.
width, height = draw_img.size
box_coords = to_image_coords(boxes, height, width)
# Each class with be represented by a differently colored box
draw_boxes(draw_img, box_coords, classes)
return np.array(draw_img)
In [30]:
with tf.Session(graph=detection_graph) as sess:
image_tensor = sess.graph.get_tensor_by_name('image_tensor:0')
detection_boxes = sess.graph.get_tensor_by_name('detection_boxes:0')
detection_scores = sess.graph.get_tensor_by_name('detection_scores:0')
detection_classes = sess.graph.get_tensor_by_name('detection_classes:0')
new_clip = clip.fl_image(pipeline)
# write to file
new_clip.write_videofile('result.mp4')
In [31]:
HTML("""
<video width="960" height="600" controls>
<source src="{0}" type="video/mp4">
</video>
""".format('result.mp4'))
Out[31]:
Some ideas to take things further:
In [ ]: