In [ ]:
%matplotlib inline
import importlib
import utils2; importlib.reload(utils2)
from utils2 import *

In [ ]:
limit_mem()

In [ ]:
from keras.datasets.cifar10 import load_batch

This notebook contains a Keras implementation of Huang et al.'s DenseNet

Our motivation behind studying DenseNet is because of how well it works with limited data.

DenseNet beats state-of-the-art results on CIFAR-10/CIFAR-100 w/ and w/o data augmentation, but the performance increase is most pronounced w/o data augmentation.

Compare to FractalNet, state-of-the-art on both datasets:

  • CIFAR-10: ~ 30 % performance increase w/ DenseNet
  • CIFAR-100: ~ 30 % performance increase w/ DenseNet

That increase is motivation enough.

So what is a DenseNet?

Put simply, DenseNet is a Resnet where we replace addition with concatenation.

Idea

Recall that in broad terms, a Resnet is a Convnet that uses residual block structures.

These "blocks" work as follows:

  • Let Lt be the input layer to block
  • Perform conv layer transformations/activations on Lt, denote by f(t)
  • Call output layer of block Lt+1
  • Define Lt+1 = f(Lt)+ Lt
    • That is, total output is the conv layer outputs plus the original input
  • We call residual block b.c. f(Lt)=Lt+1 - Lt, the residual

As mentioned, the difference w/ DenseNet is instead of adding Lt to Lt+1, it is being concatenated.

As with Resnet, DenseNet consists of multiple blocks. Therefore, there is a recursive relationship across blocks:

  • Block Bi takes as input the ouput of block Bi-1 concatenated with the input of Bi-1
  • The input to Bi-1 is the ouput of block Bi-2 concatenated with the input of Bi-2
  • So on and so forth

The number of filters added to each layer needs to be monitored, given that the input space for each block keeps growing.

Huang et al. calls the # of filters added at each layer the growth rate, and appropriately denotes this number with the related letter k.

Densenet / CIFAR 10

Let's load data.


In [ ]:
def load_data():
    path = 'data/cifar-10-batches-py'
    num_train_samples = 50000
    x_train = np.zeros((num_train_samples, 3, 32, 32), dtype='uint8')
    y_train = np.zeros((num_train_samples,), dtype='uint8')
    for i in range(1, 6):
        data, labels = load_batch(os.path.join(path, 'data_batch_' + str(i)))
        x_train[(i - 1) * 10000: i * 10000, :, :, :] = data
        y_train[(i - 1) * 10000: i * 10000] = labels
    x_test, y_test = load_batch(os.path.join(path, 'test_batch'))
    y_train = np.reshape(y_train, (len(y_train), 1))
    y_test = np.reshape(y_test, (len(y_test), 1))
    x_train = x_train.transpose(0, 2, 3, 1)
    x_test = x_test.transpose(0, 2, 3, 1)
    return (x_train, y_train), (x_test, y_test)

In [ ]:
(x_train, y_train), (x_test, y_test) = load_data()

Here's an example of CIFAR-10


In [ ]:
plt.imshow(x_train[1])

We want to normalize pixel values (0-255) to unit interval.


In [ ]:
x_train = x_train/255.
x_test = x_test/255.

Densenet

The pieces

Let's make some helper functions for piecing together our network using Keras' Functional API.

These components should all be familiar to you:

  • Relu activation
  • Dropout regularization
  • Batch-normalization

In [ ]:
def relu(x): return Activation('relu')(x)
def dropout(x, p): return Dropout(p)(x) if p else x
def bn(x): return BatchNormalization(mode=0, axis=-1)(x)
def relu_bn(x): return relu(bn(x))

Convolutional layer:

  • L2 Regularization
  • 'same' border mode returns same width/height
  • Pass output through Dropout

In [ ]:
def conv(x, nf, sz, wd, p):
    x = Convolution2D(nf, sz, sz, init='he_uniform', border_mode='same', 
                          W_regularizer=l2(wd))(x)
    return dropout(x,p)

Define ConvBlock as sequence:

  • Batchnorm
  • ReLU Activation
  • Conv layer (conv w/ Dropout)

The authors also use something called a bottleneck layer to reduce dimensionality of inputs.

Recall that the filter space dimensionality grows at each block. The input dimensionality will determine the dimensionality of your convolution weight matrices, i.e. # of parameters.

At size 3x3 or larger, convolutions can become extremely costly and # of parameters can increase quickly as a function of the input feature (filter) space. Therefore, a smart approach is to reduce dimensionality of filters by using a 1x1 convolution w/ smaller # of filters before the larger convolution.

Bottleneck consists of:

  • 1x1 conv
  • Compress # of filters into growth factor nf * 4
  • Batchnorm -> ReLU

In [ ]:
def conv_block(x, nf, bottleneck=False, p=None, wd=0):
    x = relu_bn(x)
    if bottleneck: x = relu_bn(conv(x, nf * 4, 1, wd, p))
    return conv(x, nf, 3, wd, p)

Now we can define the dense block:

  • Take given input x
  • Pass through a conv block for output b
  • Concatenate input x and conv block output b
  • Set concatenation as new input x for next block
  • Repeat

In [ ]:
def dense_block(x, nb_layers, growth_rate, bottleneck=False, p=None, wd=0):
    if bottleneck: nb_layers //= 2
    for i in range(nb_layers):
        b = conv_block(x, growth_rate, bottleneck=bottleneck, p=p, wd=wd)
        x = merge([x,b], mode='concat', concat_axis=-1)
    return x

As typical for CV architectures, we'll do some pooling after computation.

We'll define this unit as the transition block, and we'll put one between each dense block.

Aside from BN -> ReLU and Average Pooling, there is also an option for filter compression in this block. This is simply feature reduction via 1x1 conv as discussed before, where the new # of filters is a percentage of the incoming # of filters.

Together with bottleneck, compression has been shown to improve performance and computational efficiency of DenseNet architectures. (the authors call this DenseNet-BC)


In [ ]:
def transition_block(x, compression=1.0, p=None, wd=0):
    nf = int(x.get_shape().as_list()[-1] * compression)
    x = relu_bn(x)
    x = conv(x, nf, 1, wd, p)
    return AveragePooling2D((2, 2), strides=(2, 2))(x)

Build the DenseNet model

We've now defined all the building blocks (literally) to put together a DenseNet.

  • nb_classes: number of classes
  • img_input: tuple of shape (channels, rows, columns) or (rows, columns, channels)
  • depth: total number of layers
    • Includes 4 extra non-block layers
      • 1 input conv, 3 output layers
  • nb_block: number of dense blocks (generally = 3).
    • NOTE: Layers / block are evenly allocated. Therefore nb_block must be a factor of (Depth - 4)
  • growth_rate: number of filters to add per dense block
  • nb_filter: initial number of filters
  • bottleneck: add bottleneck blocks
  • Compression: Filter compression factor in transition blocks.
  • p: dropout rate
  • wd: weight decay
  • activation: Type of activation at the top layer. Can be one of 'softmax' or 'sigmoid'. Note that if sigmoid is used, classes must be 1.

Returns: keras tensor with nb_layers of conv_block appended

From start to finish, this generates:

  • Conv input layer
  • Alternate between Dense/Transition blocks nb_block times, ommitting Transition block after last Dense block
    • Each Dense block has (Depth-4)/nb_block layers
  • Pass final Dense block to BN -> ReLU
  • Global Avg Pooling
  • Dense layer w/ desired output activation

In [ ]:
def create_dense_net(nb_classes, img_input, depth=40, nb_block=3, 
     growth_rate=12, nb_filter=16, bottleneck=False, compression=1.0, p=None, wd=0, activation='softmax'):
    
    assert activation == 'softmax' or activation == 'sigmoid'
    assert (depth - 4) % nb_block == 0
    nb_layers_per_block = int((depth - 4) / nb_block)
    nb_layers = [nb_layers_per_block] * nb_block

    x = conv(img_input, nb_filter, 3, wd, 0)
    for i,block in enumerate(nb_layers):
        x = dense_block(x, block, growth_rate, bottleneck=bottleneck, p=p, wd=wd)
        if i != len(nb_layers)-1:
            x = transition_block(x, compression=compression, p=p, wd=wd)

    x = relu_bn(x)
    x = GlobalAveragePooling2D()(x)
    return Dense(nb_classes, activation=activation, W_regularizer=l2(wd))(x)

Train

Now we can test it out on CIFAR-10.


In [ ]:
input_shape = (32,32,3)

In [ ]:
img_input = Input(shape=input_shape)

In [ ]:
x = create_dense_net(10, img_input, depth=100, nb_filter=16, compression=0.5, 
                     bottleneck=True, p=0.2, wd=1e-4)

In [ ]:
model = Model(img_input, x)

In [ ]:
model.compile(loss='sparse_categorical_crossentropy', 
      optimizer=keras.optimizers.SGD(0.1, 0.9, nesterov=True), metrics=["accuracy"])

In [ ]:
parms = {'verbose': 2, 'callbacks': [TQDMNotebookCallback()]}

In [ ]:
K.set_value(model.optimizer.lr, 0.1)

This will likely need to run overnight + lr annealing...


In [ ]:
model.fit(x_train, y_train, 64, 20, validation_data=(x_test, y_test), **parms)

In [ ]:
K.set_value(model.optimizer.lr, 0.01)

In [ ]:
model.fit(x_train, y_train, 64, 4, validation_data=(x_test, y_test), **parms)

In [ ]:
K.set_value(model.optimizer.lr, 0.1)

In [ ]:
model.fit(x_train, y_train, 64, 20, validation_data=(x_test, y_test), **parms)

In [ ]:
K.set_value(model.optimizer.lr, 0.01)

In [ ]:
model.fit(x_train, y_train, 64, 40, validation_data=(x_test, y_test), **parms)

In [ ]:
K.set_value(model.optimizer.lr, 0.001)

In [ ]:
model.fit(x_train, y_train, 64, 20, validation_data=(x_test, y_test), **parms)

In [ ]:
K.set_value(model.optimizer.lr, 0.01)

In [ ]:
model.fit(x_train, y_train, 64, 10, validation_data=(x_test, y_test), **parms)

In [ ]:
K.set_value(model.optimizer.lr, 0.001)

In [ ]:
model.fit(x_train, y_train, 64, 20, validation_data=(x_test, y_test), **parms)

And we're able to replicate their state-of-the-art results!


In [ ]:
%time model.save_weights('models/93.h5')

End


In [ ]: