Fully Convolutional Neural Networks

Objectives:

  • Load a CNN model pre-trained on ImageNet
  • Transform the network into a Fully Convolutional Network
  • Apply the network perform weak segmentation on images

In [1]:
%matplotlib inline
import warnings

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1)

In [2]:
# Load a pre-trained ResNet50
# We use include_top = False for now,
# as we'll import output Dense Layer later

import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50
base_model = ResNet50(include_top=False)

print(base_model.output_shape)


(None, None, None, 2048)

In [3]:
#print(base_model.summary())

In [4]:
res5c = base_model.layers[-1]
type(res5c)


Out[4]:
tensorflow.python.keras.layers.core.Activation

In [5]:
res5c.output_shape


Out[5]:
(None, None, None, 2048)

Fully convolutional ResNet

  • Out of the res5c residual block, the resnet outputs a tensor of shape $W \times H \times 2048$.
  • For the default ImageNet input, $224 \times 224$, the output size is $7 \times 7 \times 2048$

Regular ResNet layers

The regular ResNet head after the base model is as follows:

x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1000)(x)
x = Softmax()(x)

Here is the full definition of the model: https://github.com/keras-team/keras-applications/blob/master/keras_applications/resnet50.py

Our Version

  • We want to retrieve the labels information, which is stored in the Dense layer. We will load these weights afterwards
  • We will change the Dense Layer to a Convolution2D layer to keep spatial information, to output a $W \times H \times 1000$.
  • We can use a kernel size of (1, 1) for that new Convolution2D layer to pass the spatial organization of the previous layer unchanged (it's called a pointwise convolution).
  • We want to apply a softmax only on the last dimension so as to preserve the $W \times H$ spatial information.

A custom Softmax

We build the following Custom Layer to apply a softmax only to the last dimension of a tensor:


In [6]:
from tensorflow.keras import layers

# A custom layer in Keras must implement the four following methods:
class SoftmaxMap(layers.Layer):
    # Init function
    def __init__(self, axis=-1, **kwargs):
        self.axis = axis
        super(SoftmaxMap, self).__init__(**kwargs)

    # There's no parameter, so we don't need this one
    def build(self, input_shape):
        pass

    # This is the layer we're interested in: 
    # very similar to the regular softmax but note the additional
    # that we accept x.shape == (batch_size, w, h, n_classes)
    # which is not the case in Keras by default.
    # Note also that we substract the logits by their maximum to
    # make the softmax numerically stable.
    def call(self, x, mask=None):
        e = tf.exp(x - tf.math.reduce_max(x, axis=self.axis, keepdims=True))
        s = tf.math.reduce_sum(e, axis=self.axis, keepdims=True)
        return e / s

    # The output shape is the same as the input shape
    def get_output_shape_for(self, input_shape):
        return input_shape

Let's check that we can use this layer to normalize the classes probabilities of some random spatial predictions:


In [7]:
n_samples, w, h, n_classes = 10, 3, 4, 5
random_data = np.random.randn(n_samples, w, h, n_classes).astype("float32")
random_data.shape


Out[7]:
(10, 3, 4, 5)

Because those predictions are random, if we some accross the classes dimensions we get random values instead of class probabilities that would need to some to 1:


In [8]:
random_data[0].sum(axis=-1)


Out[8]:
array([[ 0.27685624, -1.248265  , -0.17073488, -1.5251491 ],
       [ 2.3490455 , -1.4799201 , -3.292039  ,  1.5062871 ],
       [-0.08336051,  2.3915372 , -1.7528347 ,  3.5274372 ]],
      dtype=float32)

Let's create a SoftmaxMap function from the layer and process our test data:


In [9]:
softmaxMap = SoftmaxMap()
softmax_mapped_data = softmaxMap(random_data).numpy()
softmax_mapped_data.shape


Out[9]:
(10, 3, 4, 5)

All the values are now in the [0, 1] range:


In [10]:
softmax_mapped_data[0]


Out[10]:
array([[[0.56862915, 0.06077185, 0.06606976, 0.0383178 , 0.2662114 ],
        [0.01185074, 0.6777243 , 0.05529718, 0.16287254, 0.09225518],
        [0.48184162, 0.01423021, 0.08089   , 0.07605473, 0.3469834 ],
        [0.07524367, 0.19022258, 0.09395018, 0.23576528, 0.4048183 ]],

       [[0.03309341, 0.31252265, 0.24507003, 0.16442387, 0.24489002],
        [0.11887496, 0.20828551, 0.09239109, 0.180172  , 0.40027645],
        [0.19135876, 0.25699514, 0.19221951, 0.16412118, 0.19530542],
        [0.09936516, 0.03292248, 0.12721542, 0.5291481 , 0.21134889]],

       [[0.1007165 , 0.05022516, 0.05779989, 0.66288376, 0.12837464],
        [0.04103367, 0.09390692, 0.6337432 , 0.08749201, 0.1438242 ],
        [0.34717003, 0.18080053, 0.08203337, 0.18132693, 0.20866917],
        [0.17282423, 0.22243509, 0.24389896, 0.1278988 , 0.23294285]]],
      dtype=float32)

The last dimension now approximately sum to one, we can therefore be used as class probabilities (or parameters for a multinouli distribution):


In [11]:
softmax_mapped_data[0].sum(axis=-1)


Out[11]:
array([[0.99999994, 0.99999994, 1.        , 1.        ],
       [1.        , 1.        , 0.99999994, 1.        ],
       [0.99999994, 1.        , 1.        , 0.9999999 ]], dtype=float32)

Note that the highest activated channel for each spatial location is still the same before and after the softmax map. The ranking of the activations is preserved as softmax is a monotonic function (when considered element-wise):


In [12]:
random_data[0].argmax(axis=-1)


Out[12]:
array([[0, 1, 0, 4],
       [1, 4, 1, 3],
       [3, 2, 0, 2]])

In [13]:
softmax_mapped_data[0].argmax(axis=-1)


Out[13]:
array([[0, 1, 0, 4],
       [1, 4, 1, 3],
       [3, 2, 0, 2]])

Exercise

  • What is the shape of the convolution kernel we want to apply to replace the Dense ?
  • Build the fully convolutional model as described above. We want the output to preserve the spatial dimensions but output 1000 channels (one channel per class).
  • You may introspect the last elements of base_model.layers to find which layer to remove
  • You may use the Keras Convolution2D(output_channels, filter_w, filter_h) layer and our SotfmaxMap to normalize the result as per-class probabilities.
  • For now, ignore the weights of the new layer(s) (leave them initialized at random): just focus on making the right architecture with the right output shape.

In [14]:
from tensorflow.keras.layers import Convolution2D
from tensorflow.keras.models import Model

input = base_model.layers[0].input

# TODO: compute per-area class probabilites
output = input

fully_conv_ResNet = Model(inputs=input, outputs=output)

In [15]:
# %load solutions/fully_conv.py
from tensorflow.keras.layers import Convolution2D
from tensorflow.keras.models import Model

input = base_model.layers[0].input

# Take the output of the last layer of the convnet
# layer:
x = base_model.layers[-1].output

# A 1x1 convolution, with 1000 output channels, one per class
x = Convolution2D(1000, (1, 1), name='conv1000')(x)

# Softmax on last axis of tensor to normalize the class
# predictions in each spatial area
output = SoftmaxMap(axis=-1)(x)

fully_conv_ResNet = Model(inputs=input, outputs=output)

# A 1x1 convolution applies a Dense to each spatial grid location

You can use the following random data to check that it's possible to run a forward pass on a random RGB image:


In [16]:
prediction_maps = fully_conv_ResNet(np.random.randn(1, 200, 300, 3)).numpy()
prediction_maps.shape


WARNING:tensorflow:Layer conv1_pad is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Out[16]:
(1, 7, 10, 1000)

How do you explain the resulting output shape?

The class probabilities should sum to one in each area of the output map:


In [17]:
prediction_maps.sum(axis=-1)


Out[17]:
array([[[1.        , 1.0000001 , 1.        , 1.        , 1.        ,
         1.        , 1.        , 1.        , 0.9999999 , 1.        ],
        [1.        , 0.99999994, 1.        , 0.9999999 , 0.9999999 ,
         0.9999999 , 1.0000001 , 1.        , 1.        , 1.        ],
        [0.99999994, 1.        , 1.        , 1.        , 0.9999999 ,
         0.99999994, 1.0000001 , 0.9999999 , 1.        , 1.        ],
        [1.        , 0.99999994, 0.99999994, 1.        , 1.0000002 ,
         1.        , 0.99999994, 0.9999999 , 1.0000002 , 1.0000001 ],
        [1.        , 0.99999994, 1.0000001 , 0.9999999 , 1.        ,
         1.0000001 , 1.        , 1.0000001 , 0.99999994, 1.        ],
        [1.        , 1.        , 1.        , 1.        , 1.        ,
         1.        , 0.99999994, 1.        , 1.        , 1.        ],
        [1.        , 1.        , 0.9999999 , 1.        , 1.        ,
         1.        , 1.        , 0.9999999 , 0.99999994, 0.99999994]]],
      dtype=float32)

Loading Dense weights

  • We provide the weights and bias of the last Dense layer of ResNet50 in file weights_dense.h5
  • Our last layer is now a 1x1 convolutional layer instead of a fully connected layer

In [18]:
import h5py

with h5py.File('weights_dense.h5', 'r') as h5f:
    w = h5f['w'][:]
    b = h5f['b'][:]

In [19]:
last_layer = fully_conv_ResNet.layers[-2]

print("Loaded weight shape:", w.shape)
print("Last conv layer weights shape:", last_layer.get_weights()[0].shape)


Loaded weight shape: (2048, 1000)
Last conv layer weights shape: (1, 1, 2048, 1000)

In [20]:
# reshape the weights
w_reshaped = w.reshape((1, 1, 2048, 1000))

# set the conv layer weights
last_layer.set_weights([w_reshaped, b])

A forward pass

  • We define the following function to test our new network.
  • It resizes the input to a given size, then uses model.predict to compute the output

In [21]:
from tensorflow.keras.applications.imagenet_utils import preprocess_input
from skimage.io import imread
from skimage.transform import resize

def forward_pass_resize(img_path, img_size):
    img_raw = imread(img_path)
    print("Image shape before resizing: %s" % (img_raw.shape,))
    img = resize(img_raw, img_size, mode='reflect', preserve_range=True)
    img = preprocess_input(img[np.newaxis])
    print("Image batch size shape before forward pass:", img.shape)
    prediction_map = fully_conv_ResNet(img).numpy()
    return prediction_map

In [22]:
output = forward_pass_resize("dog.jpg", (800, 600))
print("prediction map shape", output.shape)


Image shape before resizing: (1600, 2560, 3)
Image batch size shape before forward pass: (1, 800, 600, 3)
prediction map shape (1, 25, 19, 1000)

ImageNet uses an ontology of concepts, from which classes are derived. A synset corresponds to a node in the ontology.

For example all species of dogs are children of the synset n02084071 (Dog, domestic dog, Canis familiaris):


In [23]:
# Helper file for importing synsets from imagenet
import imagenet_tool
synset = "n02084071" # synset corresponding to dogs
ids = imagenet_tool.synset_to_dfs_ids(synset)
print("All dog classes ids (%d):" % len(ids))
print(ids)


All dog classes ids (118):
[251, 268, 256, 253, 255, 254, 257, 159, 211, 210, 212, 214, 213, 216, 215, 219, 220, 221, 217, 218, 207, 209, 206, 205, 208, 193, 202, 194, 191, 204, 187, 203, 185, 192, 183, 199, 195, 181, 184, 201, 186, 200, 182, 188, 189, 190, 197, 196, 198, 179, 180, 177, 178, 175, 163, 174, 176, 160, 162, 161, 164, 168, 173, 170, 169, 165, 166, 167, 172, 171, 264, 263, 266, 265, 267, 262, 246, 242, 243, 248, 247, 229, 233, 234, 228, 231, 232, 230, 227, 226, 235, 225, 224, 223, 222, 236, 252, 237, 250, 249, 241, 239, 238, 240, 244, 245, 259, 261, 260, 258, 154, 153, 158, 152, 155, 151, 157, 156]

In [24]:
for dog_id in ids[:10]:
    print(imagenet_tool.id_to_words(dog_id))
print('...')


dalmatian, coach dog, carriage dog
Mexican hairless
Newfoundland, Newfoundland dog
basenji
Leonberg
pug, pug-dog
Great Pyrenees
Rhodesian ridgeback
vizsla, Hungarian pointer
German short-haired pointer
...

Unsupervised heatmap of the class "dog"

The following function builds a heatmap from a forward pass. It sums the representation for all ids corresponding to a synset


In [25]:
def build_heatmap(prediction_map, synset):
    class_ids = imagenet_tool.synset_to_dfs_ids(synset)
    class_ids = np.array([id_ for id_ in class_ids if id_ is not None])
    each_dog_proba_map = prediction_map[0, :, :, class_ids]
    # this style of indexing a tensor by an other array has the following shape effect:
    # (H, W, 1000) indexed by (118) ==> (118, H, W)
    any_dog_proba_map = each_dog_proba_map.sum(axis=0)
    print("size of heatmap: " + str(any_dog_proba_map.shape))
    return any_dog_proba_map

In [26]:
def display_img_and_heatmap(img_path, heatmap):
    dog = imread(img_path)
    plt.figure(figsize=(12, 8))
    plt.subplot(1, 2, 1)
    plt.imshow(dog)
    plt.axis('off')
    plt.subplot(1, 2, 2)
    plt.imshow(heatmap, interpolation='nearest', cmap="viridis")
    plt.axis('off')

Exercise

  • What is the size of the heatmap compared to the input image?
  • Build 3 or 4 dog heatmaps from "dog.jpg", with the following sizes:
    • (200, 320)
    • (400, 640)
    • (800, 1280)
    • (1600, 2560) (optional, requires a lot of memory)
  • What do you observe?

You may plot a heatmap using the above function display_img_and_heatmap. You might also want to reuse forward_pass_resize to compute the class maps them-selves


In [27]:
# dog synset
s = "n02084071"

# TODO

In [30]:
# %load solutions/build_heatmaps.py
s = "n02084071"

probas_1 = forward_pass_resize("dog.jpg", (200, 320))
heatmap_1 = build_heatmap(probas_1, synset=s)
display_img_and_heatmap("dog.jpg", heatmap_1)

probas_2 = forward_pass_resize("dog.jpg", (400, 640))
heatmap_2 = build_heatmap(probas_2, synset=s)
display_img_and_heatmap("dog.jpg", heatmap_2)

probas_3 = forward_pass_resize("dog.jpg", (800, 1280))
heatmap_3 = build_heatmap(probas_3, synset=s)
display_img_and_heatmap("dog.jpg", heatmap_3)

# We observe that heatmap_1 and heatmap_2 gave coarser 
# segmentations than heatmap_3. However, heatmap_3
# has small artifacts outside of the dog area
# heatmap_3 encodes more local, texture level information
# about the dog, while lower resolutions will encode more
# semantic information about the full object
# combining them is probably a good idea!


Image shape before resizing: (1600, 2560, 3)
Image batch size shape before forward pass: (1, 200, 320, 3)
size of heatmap: (7, 10)
Image shape before resizing: (1600, 2560, 3)
Image batch size shape before forward pass: (1, 400, 640, 3)
size of heatmap: (13, 20)
Image shape before resizing: (1600, 2560, 3)
Image batch size shape before forward pass: (1, 800, 1280, 3)
size of heatmap: (25, 40)

Combining the 3 heatmaps

By combining the heatmaps at different scales, we obtain a much better information about the location of the dog.

Bonus

  • Combine the three heatmap by resizing them to a similar shape, and averaging them
  • A geometric norm will work better than standard average!

In [31]:
from skimage.transform import resize

# TODO

In [32]:
# %load solutions/geom_avg.py
from skimage.transform import resize

heatmap_1_r = resize(heatmap_1, (50,80), mode='reflect',
                     preserve_range=True, anti_aliasing=True)
heatmap_2_r = resize(heatmap_2, (50,80), mode='reflect',
                     preserve_range=True, anti_aliasing=True)
heatmap_3_r = resize(heatmap_3, (50,80), mode='reflect',
                     preserve_range=True, anti_aliasing=True)


heatmap_geom_avg = np.power(heatmap_1_r * heatmap_2_r * heatmap_3_r, 0.333)
display_img_and_heatmap("dog.jpg", heatmap_geom_avg)


Bonus

Experiment with Semantic segmentation. You may train on COCO dataset http://mscoco.org/dataset/#overview

  • Use the GPU to precompute the activations of a headless and convolutionalized ResNet50 or Xception model;
  • Initialize the weights of a new Convolution2D(n_classes, 1, 1) at random;
  • Train the top of the segmentation model on class label data extracted from the MS COCO 2016 dataset;
  • Start with a single low resolution model. Then add multi-scale and see the improvement.

To go further, consider open source implementation of models rather than building your own from scratch. For instance, FAIR's detection lib (in Caffe2) provides a lot of state of the art models. https://github.com/facebookresearch/Detectron


In [ ]: