There are several architectures in the field of Convolutional Networks that have a name. The most common are:
In [7]:
# Small VGG-like convnet in Keras
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.optimizers import SGD
# Generate dummy data
def to_categorical(y, num_classes=None):
"""
Converts a class vector (integers) to binary class matrix.
"""
y = np.array(y, dtype='int').ravel()
if not num_classes:
num_classes = np.max(y) + 1
n = y.shape[0]
categorical = np.zeros((n, num_classes))
categorical[np.arange(n), y] = 1
return categorical
x_train = np.random.random((100, 100, 100, 3))
y_train = to_categorical(np.random.randint(10, size=(100, 1)), num_classes=10)
x_test = np.random.random((20, 100, 100, 3))
y_test = to_categorical(np.random.randint(10, size=(20, 1)), num_classes=10)
model = Sequential()
# input: 100x100 images with 3 channels -> (100, 100, 3) tensors.
# this applies 32 convolution filters of size 3x3 each.
model.add(Conv2D(32, 3, 3, activation='relu', input_shape=(100, 100, 3)))
model.add(Conv2D(32, 3, 3, activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, 3, 3, activation='relu'))
model.add(Conv2D(64, 3, 3, activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd)
print(model.summary())
model.fit(x_train, y_train, batch_size=32, nb_epoch=10)
score = model.evaluate(x_test, y_test, batch_size=32)
Exercise
Why do we have 896 parameters in the convolution2d_1
layer of the previous example?
Compute the number of parameters of the original VGG16 (all CONV layers are 3x3).
The VGG16 architecture is: INPUT: [224x224x3] $\rightarrow$ CONV3-64: [224x224x64] $\rightarrow$ CONV3-64: [224x224x64] $\rightarrow$ POOL2: [112x112x64] $\rightarrow$ CONV3-128: [112x112x128] $\rightarrow$ CONV3-128: [112x112x128] $\rightarrow$ POOL2: [56x56x128] $\rightarrow$ CONV3-256: [56x56x256] $\rightarrow$ CONV3-256: [56x56x256] $\rightarrow$ CONV3-256: [56x56x256] $\rightarrow$ POOL2: [28x28x256] $\rightarrow$ CONV3-512: [28x28x512] $\rightarrow$ CONV3-512: [28x28x512] $\rightarrow$ CONV3-512: [28x28x512] $\rightarrow$ POOL2: [14x14x512] $\rightarrow$ CONV3-512: [14x14x512] $\rightarrow$ CONV3-512: [14x14x512] $\rightarrow$ CONV3-512: [14x14x512] $\rightarrow$ POOL2: [7x7x512] $\rightarrow$ FC: [1x1x4096] $\rightarrow$ FC: [1x1x4096] $\rightarrow$ FC: [1x1x1000].
The largest bottleneck to be aware of when constructing ConvNet architectures is the memory bottleneck. What is the necessary memory size (supposing that we need 4 bytes for each element) to store intermediate data?
In [ ]:
# your code here
Blue Box: Convolution | Red Box: Pooling | Yelow Box: Softmax | Green Box: Normalization
What is the role of 1x1 convolutions?
When it comes to neural network design, the trend in the past few years has pointed in one direction: deeper.
Whereas the state of the art only a few years ago consisted of networks which were roughly twelve layers deep, it is now not surprising to come across networks which are hundreds of layers deep.
This move hasn’t just consisted of greater depth for depths sake. For many applications, the most prominent of which being object classification, the deeper the neural network, the better the performance.
So the problem is to design a network in which the gradient can more easily reach all the layers of a network which might be dozens, or even hundreds of layers deep. This is the goal behind some of state of the art architectures: ResNets, HighwayNets, and DenseNets.
HighwayNets builds on the ResNet in a pretty intuitive way. The Highway Network preserves the shortcuts introduced in the ResNet, but augments them with a learnable parameter to determine to what extent each layer should be a skip connection or a nonlinear connection. Layers in a Highway Network are defined as follows:
$$ y = H(x, W_H) \cdot T(x,W_T) + x \cdot C(x, W_C) $$
In this equation we can see an outline of two kinds of layers discussed: $y = H(x,W_H)$ mirrors the traditional layer, and $y = H(x,W_H) + x$ mirrors our residual unit.
The traditional layer can be implemented as:
def dense(x, input_size, output_size, activation):
W = tf.Variable(tf.truncated_normal([input_size, output_size], stddev=0.1), name="weight")
b = tf.Variable(tf.constant(0.1, shape=[output_size]), name="bias")
y = activation(tf.matmul(x, W) + b)
return y
What is new is the $T(x,W_t)$, the transform gate function and $C(x,W_C) = 1 - T(x,W_t)$, the carry gate function. What happens is that when the transform gate is 1, we pass through our activation (H) and suppress the carry gate (since it will be 0). When the carry gate is 1, we pass through the unmodified input (x), while the activation is suppressed.
def highway(x, size, activation, carry_bias=-1.0):
W_T = tf.Variable(tf.truncated_normal([size, size], stddev=0.1), name="weight_transform")
b_T = tf.Variable(tf.constant(carry_bias, shape=[size]), name="bias_transform")
W = tf.Variable(tf.truncated_normal([size, size], stddev=0.1), name="weight")
b = tf.Variable(tf.constant(0.1, shape=[size]), name="bias")
T = tf.sigmoid(tf.matmul(x, W_T) + b_T, name="transform_gate")
H = activation(tf.matmul(x, W) + b, name="activation")
C = tf.sub(1.0, T, name="carry_gate")
y = tf.add(tf.mul(H, T), tf.mul(x, C), "y")
return y
With this kind of network you can train models with hundreds of layers.
DenseNet takes the insights of the skip connection to the extreme. The idea here is that if connecting a skip connection from the previous layer improves performance, why not connect every layer to every other layer? That way there is always a direct route for the information backwards through the network.
Instead of using an addition however, the DenseNet relies on stacking of layers. Mathematically this looks like:
$$ y = f(x, x-1, x-2, \dots, x-n) $$This architecture makes intuitive sense in both the feedforward and feed backward settings. In the feed-forward setting, a task may benefit from being able to get low-level feature activations in addition to high level feature activations. In classifying objects for example, a lower layer of the network may determine edges in an image, whereas a higher layer would determine larger-scale features such as presence of faces. There may be cases where being able to use information about edges can help in determining the correct object in a complex scene. In the backwards case, having all the layers connected allows us to quickly send gradients to their respective places in the network easily.
(Source: http://cs231n.github.io/convolutional-networks/#convert)
The only difference between Fully Connected (FC) and Convolutional (CONV) layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters.
However, the neurons in both layers still compute dot products, so their functional form is identical.
Then, it is easy to see that for any CONV layer there is an FC layer that implements the same forward function. The weight matrix would be a large matrix that is mostly zero except for at certain blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing).
Conversely, any FC layer can be converted to a CONV layer.
Let $F$ be the receptive field size of the CONV layer neurons, $S$ the stride with which they are applied, $P$ the amount of zero padding used on the border, and $K$ the depth (number of bands) of the CONV layer.
For example, an FC layer with $K=4096$ that is looking at some input volume of size $7×7×512$ can be equivalently expressed as a CONV layer with $F=7,P=0,S=1,K=4096$.
In other words, we are setting the filter size to be exactly the size of the input volume, and hence the output will simply be 1×1×4096 since only a single depth column “fits” across the input volume, giving identical result as the initial FC layer.
This can be very useful!
Consider a ConvNet architecture that takes a $224x224x3$ image, and then uses a series of CONV layers and POOL layers to reduce the image to an activations volume of size $7x7x512$. From there, two FC layers of size $4096$ and finally the last FC layers with $1000$ neurons that compute the class scores. We can convert each of these three FC layers to CONV layers as described above:
It turns out that this conversion allows us to “slide” the original ConvNet very efficiently across many spatial positions in a larger image, in a single forward pass.
For example, if $224x224$ image gives a volume of size $[7x7x512]$ - i.e. a reduction by 32, then forwarding an image of size 384x384 through the converted architecture would give the equivalent volume in size $[12x12x512]$, since $384/32 = 12$. Following through with the next 3 CONV layers that we just converted from FC layers would now give the final volume of size $[6x6x1000]$, since $(12 - 7)/1 + 1 = 6$. Note that instead of a single vector of class scores of size $[1x1x1000]$, we’re now getting and entire $6x6$ array of class scores across the $384x384$ image.
Evaluating the original ConvNet (with FC layers) independently across $224x224$ crops of the $384x384$ image in strides of 32 pixels gives an identical result to forwarding the converted ConvNet one time. Forwarding the converted ConvNet a single time is much more efficient than iterating the original ConvNet over all those 36 locations, since the 36 evaluations share computation.
In classification, there’s generally an image with a single object as the focus and the task is to say what that image is. But when we look at the world around us, we see complicated sights with multiple overlapping objects, and different backgrounds and we not only classify these different objects but also identify their boundaries, differences, and relations to one another!
To what extent do CNN generalize to object detection? Object detection is the task of finding the different objects in an image and classifying them.
A team, comprised of Ross Girshick (a name we’ll see again), Jeff Donahue, and Trevor Darrel found that this problem can be solved with AlexNet by testing on the PASCAL VOC Challenge, a popular object detection challenge akin to ImageNet.
The goal of R-CNN is to take in an image, and correctly identify where the main objects (via a bounding box) in the image.
Inputs: Image
Outputs: Bounding boxes + labels for each object in the image.
But how do we find out where these bounding boxes are? R-CNN proposes a bunch of boxes in the image and see if any of them actually correspond to an object.
R-CNN creates these bounding boxes, or region proposals, using a process called Selective Search (see http://www.cs.cornell.edu/courses/cs7670/2014sp/slides/VisionSeminar14.pdf).
At a high level, Selective Search (shown in the image above) looks at the image through windows of different sizes, and for each size tries to group together adjacent pixels by texture, color, or intensity to identify objects.
Once the proposals are created, R-CNN warps the region to a standard square size and passes it through to a modified version of AlexNet.
On the final layer of the CNN, R-CNN adds a Support Vector Machine (SVM) that simply classifies whether this is an object, and if so what object.
Now, having found the object in the box, can we tighten the box to fit the true dimensions of the object? We can, and this is the final step of R-CNN. R-CNN runs a simple linear regression on the region proposal to generate tighter bounding box coordinates to get our final result. Here are the inputs and outputs of this regression model:
Inputs: sub-regions of the image corresponding to objects.
Outputs: New bounding box coordinates for the object in the sub-region.
R-CNN works really well, but is really quite slow for a few simple reasons:
In 2015, Ross Girshick, the first author of R-CNN, solved both these problems, leading to Fast R-CNN.
For the forward pass of the CNN, Girshick realized that for each image, a lot of proposed regions for the image invariably overlapped causing us to run the same CNN computation again and again (~2000 times!). His insight was simple — Why not run the CNN just once per image and then find a way to share that computation across the ~2000 proposals?
This is exactly what Fast R-CNN does using a technique known as RoIPool (Region of Interest Pooling). At its core, RoIPool shares the forward pass of a CNN for an image across its subregions. In the image below, notice how the CNN features for each region are obtained by selecting a corresponding region from the CNN’s feature map. Then, the features in each region are pooled (usually using max pooling). So all it takes us is one pass of the original image as opposed to ~2000!
The second insight of Fast R-CNN is to jointly train the CNN, classifier, and bounding box regressor in a single model. Where earlier we had different models to extract image features (CNN), classify (SVM), and tighten bounding boxes (regressor), Fast R-CNN instead used a single network to compute all three.
Even with all these advancements, there was still one remaining bottleneck in the Fast R-CNN process — the region proposer. As we saw, the very first step to detecting the locations of objects is generating a bunch of potential bounding boxes or regions of interest to test. In Fast R-CNN, these proposals were created using Selective Search, a fairly slow process that was found to be the bottleneck of the overall process.
In the middle 2015, a team at Microsoft Research composed of Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, found a way to make the region proposal step almost cost free through an architecture they (creatively) named Faster R-CNN.
The insight of Faster R-CNN was that region proposals depended on features of the image that were already calculated with the forward pass of the CNN (first step of classification). So why not reuse those same CNN results for region proposals instead of running a separate selective search algorithm?
Here are the inputs and outputs of their model:
Inputs: Images (Notice how region proposals are not needed).
Outputs: Classifications and bounding box coordinates of objects in the images.
So far, we’ve seen how we’ve been able to use CNN features in many interesting ways to effectively locate different objects in an image with bounding boxes.
Can we extend such techniques to go one step further and locate exact pixels of each object instead of just bounding boxes? This problem, known as image segmentation, is what Kaiming He and a team of researchers, including Girshick, explored at Facebook AI using an architecture known as Mask R-CNN.
Given that Faster R-CNN works so well for object detection, could we extend it to also carry out pixel level segmentation?
Mask R-CNN does this by adding a branch to Faster R-CNN that outputs a binary mask that says whether or not a given pixel is part of an object. The branch (in white in the above image), as before, is just a Fully Convolutional Network on top of a CNN based feature map. Here are its inputs and outputs:
Inputs: CNN Feature Map. Outputs: Matrix with 1s on all locations where the pixel belongs to the object and 0s elsewhere (this is known as a binary mask).
IMDB Movie reviews sentiment classification: Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
The seminal research paper on this subject was published by Yoon Kim on 2014. In this paper Yoon Kim has laid the foundations for how to model and process text by convolutional neural networks for the purpose of sentiment analysis. He has shown that by simple one-dimentional convolutional networks, one can develops very simple neural networks that reach 90% accuracy very quickly.
Here is the text of an example review from our dataset:
In [1]:
'''
This example demonstrates the use of Convolution1D for text classification.
'''
from __future__ import print_function
import numpy as np
np.random.seed(1337) # for reproducibility
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
from keras.layers.convolutional import Convolution1D, MaxPooling1D
from keras.datasets import imdb
# set parameters:
max_features = 5000
maxlen = 100
batch_size = 32
embedding_dims = 100
nb_filter = 250
filter_length = 3
hidden_dims = 250
nb_epoch = 10
print('Loading data...')
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features)
print(len(X_train), ' train sequences \n')
print(len(X_test), ' test sequences \n')
print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('Build model...')
model = Sequential()
# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_features, embedding_dims, input_length=maxlen))
model.add(Dropout(0.25))
# we add a Convolution1D, which will learn nb_filter
# word group filters of size filter_length:
model.add(Convolution1D(nb_filter=nb_filter,
filter_length=filter_length,
border_mode='valid',
activation='relu',
subsample_length=1))
# we use standard max pooling (halving the output of the previous layer):
model.add(MaxPooling1D(pool_length=2))
model.add(Convolution1D(nb_filter=nb_filter,
filter_length=filter_length,
border_mode='valid',
activation='relu',
subsample_length=1))
model.add(MaxPooling1D(pool_length=2))
# We flatten the output of the conv layer,
# so that we can add a vanilla dense layer:
model.add(Flatten())
# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.25))
model.add(Activation('relu'))
# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
model.fit(X_train, y_train,
batch_size=batch_size,
nb_epoch=nb_epoch,
validation_data=(X_test, y_test))
Out[1]: