Fully Connected Feed-Forward Network

In this notebook we will play with Feed-Forward FC-NN (Fully Connected Neural Network) for a classification task:

Image Classification on MNIST Dataset

RECALL

In the FC-NN, the output of each layer is computed using the activations from the previous one, as follows:

$$h_{i} = \sigma(W_i h_{i-1} + b_i)$$

where ${h}_i$ is the activation vector from the $i$-th layer (or the input data for $i=0$), ${W}_i$ and ${b}_i$ are the weight matrix and the bias vector for the $i$-th layer, respectively.
$\sigma(\cdot)$ is the activation function. In our example, we will use the ReLU activation function for the hidden layers and softmax for the last layer.

To regularize the model, we will also insert a Dropout layer between consecutive hidden layers.

Dropout works by “dropping out” some unit activations in a given layer, that is setting them to zero with a given probability.

Our loss function will be the categorical crossentropy.

Model definition

Keras supports two different kind of models: the Sequential model and the Graph model. The former is used to build linear stacks of layer (so each layer has one input and one output), and the latter supports any kind of connection graph.

In our case we build a Sequential model with three Dense (aka fully connected) layers, with some Dropout. Notice that the output layer has the softmax activation function.

The resulting model is actually a function of its own inputs implemented using the Keras backend.

We apply the binary crossentropy loss and choose SGD as the optimizer.

Please remind that Keras supports a variety of different optimizers and loss functions, which you may want to check out.



In [1]:

    
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

Introducing ReLU

The ReLu function is defined as $f(x) = \max(0, x),$ [1]

A smooth approximation to the rectifier is the analytic function: $f(x) = \ln(1 + e^x)$

which is called the softplus function.

The derivative of softplus is $f'(x) = e^x / (e^x + 1) = 1 / (1 + e^{-x})$, i.e. the logistic function.

[1] http://www.cs.toronto.edu/~fritz/absps/reluICML.pdf by G. E. Hinton

Note: Keep in mind this function as it is heavily used in CNN



In [2]:

    
from keras.models import Sequential
from keras.layers.core import Dense
from keras.optimizers import SGD

nb_classes = 10

# FC@512+relu -> FC@512+relu -> FC@nb_classes+softmax
# ... your Code Here









    



Using TensorFlow backend.



In [3]:

    
# %load ../solutions/sol_321.py



In [4]:

    
from keras.models import Sequential
from keras.layers.core import Dense
from keras.optimizers import SGD

model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(784,)))
model.add(Dense(512, activation='relu'))
model.add(Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.001), 
              metrics=['accuracy'])

Data preparation (`keras.dataset`)

We will train our model on the MNIST dataset, which consists of 60,000 28x28 grayscale images of the 10 digits, along with a test set of 10,000 images.

Since this dataset is provided with Keras, we just ask the keras.dataset model for training and test data.

We will:

download the data
reshape data to be in vectorial form (original data are images)
normalize between 0 and 1.

The binary_crossentropy loss expects a one-hot-vector as input, therefore we apply the to_categorical function from keras.utilis to convert integer labels to one-hot-vectors.



In [5]:

    
from keras.datasets import mnist
from keras.utils import np_utils

(X_train, y_train), (X_test, y_test) = mnist.load_data()









    



Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
11493376/11490434 [==============================] - 11s 1us/step



In [6]:

    
X_train.shape









    Out[6]:





(60000, 28, 28)



In [7]:

    
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
X_train = X_train.astype("float32")
X_test = X_test.astype("float32")

# Put everything on grayscale
X_train /= 255
X_test /= 255

# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, 10)
Y_test = np_utils.to_categorical(y_test, 10)

Split Training and Validation Data



In [8]:

    
from sklearn.model_selection import train_test_split

X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train)



In [9]:

    
X_train[0].shape









    Out[9]:





(784,)



In [10]:

    
plt.imshow(X_train[0].reshape(28, 28))









    Out[10]:





<matplotlib.image.AxesImage at 0x7fab940d8ef0>



In [11]:

    
print(np.asarray(range(10)))
print(Y_train[0].astype('int'))









    



[0 1 2 3 4 5 6 7 8 9]
[0 0 0 1 0 0 0 0 0 0]



In [12]:

    
plt.imshow(X_val[0].reshape(28, 28))









    Out[12]:





<matplotlib.image.AxesImage at 0x7fab940c97f0>



In [13]:

    
print(np.asarray(range(10)))
print(Y_val[0].astype('int'))









    



[0 1 2 3 4 5 6 7 8 9]
[0 0 0 0 0 1 0 0 0 0]

Training

Having defined and compiled the model, it can be trained using the fit function. We also specify a validation dataset to monitor validation loss and accuracy.



In [14]:

    
network_history = model.fit(X_train, Y_train, batch_size=128, 
                            epochs=2, verbose=1, validation_data=(X_val, Y_val))









    



Train on 45000 samples, validate on 15000 samples
Epoch 1/2
45000/45000 [==============================] - 11s 252us/step - loss: 2.2497 - acc: 0.1736 - val_loss: 2.1271 - val_acc: 0.3627
Epoch 2/2
45000/45000 [==============================] - 10s 229us/step - loss: 2.0110 - acc: 0.5167 - val_loss: 1.8907 - val_acc: 0.6396

Plotting Network Performance Trend

The return value of the fit function is a keras.callbacks.History object which contains the entire history of training/validation loss and accuracy, for each epoch. We can therefore plot the behaviour of loss and accuracy during the training phase.



In [15]:

    
import matplotlib.pyplot as plt
%matplotlib inline

def plot_history(network_history):
    plt.figure()
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.plot(network_history.history['loss'])
    plt.plot(network_history.history['val_loss'])
    plt.legend(['Training', 'Validation'])

    plt.figure()
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.plot(network_history.history['acc'])
    plt.plot(network_history.history['val_acc'])
    plt.legend(['Training', 'Validation'], loc='lower right')
    plt.show()

plot_history(network_history)

After 2 epochs, we get a ~88% validation accuracy.

If you increase the number of epochs, you will get definitely better results.

Quick Exercise:

Try increasing the number of epochs (if you're hardware allows to)



In [21]:

    
# Your code here
model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.001), 
              metrics=['accuracy'])
network_history = model.fit(X_train, Y_train, batch_size=128, 
                            epochs=2, verbose=1, validation_data=(X_val, Y_val))









    



Train on 45000 samples, validate on 15000 samples
Epoch 1/2
45000/45000 [==============================] - 2s - loss: 0.8966 - acc: 0.8258 - val_loss: 0.8463 - val_acc: 0.8299
Epoch 2/2
45000/45000 [==============================] - 1s - loss: 0.8005 - acc: 0.8370 - val_loss: 0.7634 - val_acc: 0.8382

Introducing the Dropout Layer

The dropout layers have the very specific function to drop out a random set of activations in that layers by setting them to zero in the forward pass. Simple as that.

It allows to avoid overfitting but has to be used only at training time and not at test time.

keras.layers.core.Dropout(rate, noise_shape=None, seed=None)

Applies Dropout to the input.

Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting.

Arguments

rate: float between 0 and 1. Fraction of the input units to drop.
noise_shape: 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input. For instance, if your inputs have shape (batch_size, timesteps, features) and you want the dropout mask to be the same for all timesteps, you can use noise_shape=(batch_size, 1, features).
seed: A Python integer to use as random seed.

Note Keras guarantess automatically that this layer is not used in Inference (i.e. Prediction) phase (thus only used in training as it should be!)

See keras.backend.in_train_phase function



In [3]:

    
from keras.layers.core import Dropout

## Pls note **where** the `K.in_train_phase` is actually called!!
Dropout??



In [2]:

    
from keras import backend as K

K.in_train_phase?

Exercise:

Try modifying the previous example network adding a Dropout layer:



In [14]:

    
from keras.layers.core import Dropout

# FC@512+relu -> DropOut(0.2) -> FC@512+relu -> DropOut(0.2) -> FC@nb_classes+softmax
# ... your Code Here



In [ ]:

    
# %load ../solutions/sol_312.py



In [14]:

    
network_history = model.fit(X_train, Y_train, batch_size=128, 
                            epochs=4, verbose=1, validation_data=(X_val, Y_val))
plot_history(network_history)









    



Train on 45000 samples, validate on 15000 samples
Epoch 1/4
45000/45000 [==============================] - 2s - loss: 1.3746 - acc: 0.6348 - val_loss: 0.6917 - val_acc: 0.8418
Epoch 2/4
45000/45000 [==============================] - 2s - loss: 0.6235 - acc: 0.8268 - val_loss: 0.4541 - val_acc: 0.8795
Epoch 3/4
45000/45000 [==============================] - 1s - loss: 0.4827 - acc: 0.8607 - val_loss: 0.3795 - val_acc: 0.8974
Epoch 4/4
45000/45000 [==============================] - 1s - loss: 0.4218 - acc: 0.8781 - val_loss: 0.3402 - val_acc: 0.9055

If you continue training, at some point the validation loss will start to increase: that is when the model starts to overfit.

It is always necessary to monitor training and validation loss during the training of any kind of Neural Network, either to detect overfitting or to evaluate the behaviour of the model (any clue on how to do it??)



In [ ]:

    
# %load solutions/sol23.py
from keras.callbacks import EarlyStopping

early_stop = EarlyStopping(monitor='val_loss', patience=4, verbose=1)

model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(784,)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer=SGD(), 
              metrics=['accuracy'])
    
model.fit(X_train, Y_train, validation_data = (X_test, Y_test), epochs=100, 
          batch_size=128, verbose=True, callbacks=[early_stop])

Inspecting Layers



In [15]:

    
# We already used `summary`
model.summary()









    



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_4 (Dense)              (None, 512)               401920    
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 512)               262656    
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 10)                5130      
=================================================================
Total params: 669,706
Trainable params: 669,706
Non-trainable params: 0
_________________________________________________________________

`model.layers` is iterable



In [16]:

    
print('Model Input Tensors: ', model.input, end='\n\n')
print('Layers - Network Configuration:', end='\n\n')
for layer in model.layers:
    print(layer.name, layer.trainable)
    print('Layer Configuration:')
    print(layer.get_config(), end='\n{}\n'.format('----'*10))
print('Model Output Tensors: ', model.output)









    



Model Input Tensors:  Tensor("dense_4_input:0", shape=(?, 784), dtype=float32)

Layers - Network Configuration:

dense_4 True
Layer Configuration:
{'batch_input_shape': (None, 784), 'name': 'dense_4', 'units': 512, 'bias_regularizer': None, 'bias_initializer': {'config': {}, 'class_name': 'Zeros'}, 'trainable': True, 'activation': 'relu', 'use_bias': True, 'bias_constraint': None, 'activity_regularizer': None, 'kernel_regularizer': None, 'kernel_constraint': None, 'kernel_initializer': {'config': {'seed': None, 'mode': 'fan_avg', 'scale': 1.0, 'distribution': 'uniform'}, 'class_name': 'VarianceScaling'}, 'dtype': 'float32'}
----------------------------------------
dropout_1 True
Layer Configuration:
{'name': 'dropout_1', 'rate': 0.2, 'trainable': True}
----------------------------------------
dense_5 True
Layer Configuration:
{'kernel_regularizer': None, 'units': 512, 'bias_regularizer': None, 'bias_initializer': {'config': {}, 'class_name': 'Zeros'}, 'trainable': True, 'activation': 'relu', 'bias_constraint': None, 'activity_regularizer': None, 'name': 'dense_5', 'kernel_constraint': None, 'kernel_initializer': {'config': {'seed': None, 'mode': 'fan_avg', 'scale': 1.0, 'distribution': 'uniform'}, 'class_name': 'VarianceScaling'}, 'use_bias': True}
----------------------------------------
dropout_2 True
Layer Configuration:
{'name': 'dropout_2', 'rate': 0.2, 'trainable': True}
----------------------------------------
dense_6 True
Layer Configuration:
{'kernel_regularizer': None, 'units': 10, 'bias_regularizer': None, 'bias_initializer': {'config': {}, 'class_name': 'Zeros'}, 'trainable': True, 'activation': 'softmax', 'bias_constraint': None, 'activity_regularizer': None, 'name': 'dense_6', 'kernel_constraint': None, 'kernel_initializer': {'config': {'seed': None, 'mode': 'fan_avg', 'scale': 1.0, 'distribution': 'uniform'}, 'class_name': 'VarianceScaling'}, 'use_bias': True}
----------------------------------------
Model Output Tensors:  Tensor("dense_6/Softmax:0", shape=(?, 10), dtype=float32)

Extract hidden layer representation of the given data

One simple way to do it is to use the weights of your model to build a new model that's truncated at the layer you want to read.

Then you can run the ._predict(X_batch) method to get the activations for a batch of inputs.



In [17]:

    
model_truncated = Sequential()
model_truncated.add(Dense(512, activation='relu', input_shape=(784,)))
model_truncated.add(Dropout(0.2))
model_truncated.add(Dense(512, activation='relu'))

for i, layer in enumerate(model_truncated.layers):
    layer.set_weights(model.layers[i].get_weights())

model_truncated.compile(loss='categorical_crossentropy', optimizer=SGD(), 
              metrics=['accuracy'])



In [18]:

    
# Check
np.all(model_truncated.layers[0].get_weights()[0] == model.layers[0].get_weights()[0])









    Out[18]:





True



In [19]:

    
hidden_features = model_truncated.predict(X_train)



In [20]:

    
hidden_features.shape









    Out[20]:





(45000, 512)



In [21]:

    
X_train.shape









    Out[21]:





(45000, 784)

Hint: Alternative Method to get activations

(Using keras.backend function on Tensors)

def get_activations(model, layer, X_batch):
    activations_f = K.function([model.layers[0].input, K.learning_phase()], [layer.output,])
    activations = activations_f((X_batch, False))
    return activations

Generate the Embedding of Hidden Features



In [24]:

    
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(hidden_features[:1000]) ## Reduced for computational issues



In [29]:

    
colors_map = np.argmax(Y_train, axis=1)



In [32]:

    
X_tsne.shape









    Out[32]:





(1000, 2)



In [49]:

    
nb_classes









    Out[49]:





10



In [53]:

    
np.where(colors_map==6)









    Out[53]:





(array([  1,  30,  62,  73,  86,  88,  89, 109, 112, 114, 123, 132, 134,
        137, 150, 165, 173, 175, 179, 215, 216, 217, 224, 235, 242, 248,
        250, 256, 282, 302, 303, 304, 332, 343, 352, 369, 386, 396, 397,
        434, 444, 456, 481, 493, 495, 496, 522, 524, 527, 544, 558, 571,
        595, 618, 625, 634, 646, 652, 657, 666, 672, 673, 676, 714, 720,
        727, 732, 737, 796, 812, 813, 824, 828, 837, 842, 848, 851, 854,
        867, 869, 886, 894, 903, 931, 934, 941, 950, 956, 970, 972, 974, 988]),)



In [55]:

    
colors = np.array([x for x in 'b-g-r-c-m-y-k-purple-coral-lime'.split('-')])
colors_map = colors_map[:1000]
plt.figure(figsize=(10,10))
for cl in range(nb_classes):
    indices = np.where(colors_map==cl)
    plt.scatter(X_tsne[indices,0], X_tsne[indices, 1], c=colors[cl], label=cl)
plt.legend()
plt.show()

Using Bokeh (Interactive Chart)



In [67]:

    
from bokeh.plotting import figure, output_notebook, show

output_notebook()









    





    
        
        Loading BokehJS ...



In [74]:

    
p = figure(plot_width=600, plot_height=600)

colors = [x for x in 'blue-green-red-cyan-magenta-yellow-black-purple-coral-lime'.split('-')]
colors_map = colors_map[:1000]
for cl in range(nb_classes):
    indices = np.where(colors_map==cl)
    p.circle(X_tsne[indices, 0].ravel(), X_tsne[indices, 1].ravel(), size=7, 
             color=colors[cl], alpha=0.4, legend=str(cl))

# show the results
p.legend.location = 'bottom_right'
show(p)

Note: We used `default` TSNE parameters. Better results can be achieved by tuning TSNE Hyper-parameters

Exercise 1:

Try with a different algorithm to create the manifold



In [75]:

    
from sklearn.manifold import MDS



In [ ]:

    
## Your code here

Exercise 2:

Try extracting the Hidden features of the First and the Last layer of the model



In [ ]:

    
## Your code here



In [2]:

    
## Try using the `get_activations` function relying on keras backend
def get_activations(model, layer, X_batch):
    activations_f = K.function([model.layers[0].input, K.learning_phase()], [layer.output,])
    activations = activations_f((X_batch, False))
    return activations



In [ ]: