Deep learning

  • NN essentials with image and text classification
  • CNNs and RNNs
  • Common NN headaches

Essentials

How is the animal brain capable of deciding a type from a sequence of sensory inputs? What defines a dog in the image of a dog? What in the wording will make you say this is a happy person who pretends being sad? How many pictures of a dog must a child see in order to correctly label a dog? Deep learning, from an algorithmic perspective, is the application of advanced multi-layered filters to learn hidden features in data representation. Many of the methods that are used today in DL, such as most neural network types (and not only), went through a 20 years long pause due to the fact that the computing machines avalable at the era were too slow to produce wanted results. It was several things that precipitated their return in 2010:

  • Graphical processors. A GPU has thousands of cores that are specialized in concomitant linear operations. This provided the infrastructure on which "deep" algorithms perform the best.
  • The maturity of cloud computing. This enables third parties to use DL methodologies at scale, and with small operating costs.
  • Big data. Most AI needs models to be trained on a lot of data, thus AI needs a sufficient level of data availability. The massive acumulation of data (not only in biology) is a very recent phenomenon.

As with everything else in this course, DL is a science in itself, and contains thousands pages of theory. For an in-depth reading we recommend this book, written by some of the early specialists in neural network: http://www.deeplearningbook.org/. It is today possibly the top bestseller in the field of machine learning (also, it is free).

Driven by practicality as we are for the purpose of this course, we will dwelve directly into an example of using DL. Python was an early favorite for DL, and initial libraries such as Theano started an approach based on defining the problem via Python, then compiling a program that can be executed faster than a regular code on either CPU and GPU. Later in 2013 Google made Tensorflow available as open source library and different other DL libraries have mushroomed. The difficulty of operating such libraries comes from their specialist language and the Pythonic way to solve this and open AI to everyone was to make a library that would be able to use various DL engines, while simplifying the API. Thus came into being Keras.

$ conda install keras

In [2]:
import keras


Using TensorFlow backend.

In [5]:
from keras.datasets import mnist
from keras import models
from keras import layers
from keras.utils import to_categorical


(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
# reshape and scale images
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255
# convert labels
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

# defining the NN structure
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))
network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

Explanations:

  • layer: this is the fundamental organization of a deep network, in fact the network is deep in the sense of having many layers. Our "deep" network only has two layers. They are "dense" meaning each layer is fully connected. Each layer performs a filtering of its input based on an activation function, the softmax filter returns an array of 10 probabilities that sum up to 1.
  • compilation: this step builds the actual program that must be run. Python does not compile code, so the actual code being generated is not Python. The loss function ie evaluating the model fit and the optimizer is the actual fitting function. Accuracy is simpy the fraction of the images that were correctly classified.

Read more:


In [6]:
network.fit(train_images, train_labels, epochs=5, batch_size=128)


Epoch 1/5
60000/60000 [==============================] - 9s 158us/step - loss: 0.2572 - acc: 0.9254
Epoch 2/5
60000/60000 [==============================] - 9s 147us/step - loss: 0.1026 - acc: 0.9695
Epoch 3/5
60000/60000 [==============================] - 8s 134us/step - loss: 0.0681 - acc: 0.9798
Epoch 4/5
60000/60000 [==============================] - 9s 144us/step - loss: 0.0494 - acc: 0.9850
Epoch 5/5
60000/60000 [==============================] - 9s 145us/step - loss: 0.0371 - acc: 0.9892
Out[6]:
<keras.callbacks.History at 0x7f9efd5bba90>

In [8]:
test_loss, test_acc = network.evaluate(test_images, test_labels)
print(test_loss, test_acc)


10000/10000 [==============================] - 1s 99us/step
0.06568861380360322 0.9801

Observations:

  • slightly smaller accuracy on the test data compared to training data (model overfits on the training data)
  • "epochs" are complete runs through the dataset. Batches are used because the whole dataset is hard to be passed through the network at once.
  • most optimizers are based on gradient descent, an algorithm that is very eficient on GPUs today, but gives local optima.

Questions:

  • Why do we need several epochs?
  • What is the main computer limitation when it comes to batches?
  • How many epochs are needed, and what is the danger associated with using too many or too few?

Reading:

Text classification

The purpose is to cathegorize films into good or bad based on their reviews. Data is vectorized into binary.

layer activation

What happens during layer activation? Basically a set of tensor operations are being performed. A simplistic way to understand this is operations done on array of matrices, while the atomic operation would be:

output = relu(dot(W, input) + b)

, where the weight matrix W shape is (input_dim (10000), 16) and b is a bias term. In linear algebra terms, this will project teh input data onto a 16 dimensional space. The more dimensions, the more features, the more confusion, and more computing cost BUT also more complex representations. What visual features define a dog?


In [4]:
import numpy as np
from keras.datasets import imdb
from keras import models
from keras import layers
from keras import optimizers
from keras import losses
from keras import metrics

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
print(max([max(sequence) for sequence in train_data]))

def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer=optimizers.RMSprop(lr=0.001),
              loss=losses.binary_crossentropy,
              metrics=[metrics.binary_accuracy])

x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=5,
                    batch_size=512,
                    validation_data=(x_val, y_val))

model.predict(x_test)


9999
Train on 15000 samples, validate on 10000 samples
Epoch 1/5
15000/15000 [==============================] - 4s 266us/step - loss: 0.5090 - binary_accuracy: 0.7814 - val_loss: 0.3795 - val_binary_accuracy: 0.8693
Epoch 2/5
15000/15000 [==============================] - 2s 157us/step - loss: 0.3006 - binary_accuracy: 0.9049 - val_loss: 0.3004 - val_binary_accuracy: 0.8895
Epoch 3/5
15000/15000 [==============================] - 2s 159us/step - loss: 0.2180 - binary_accuracy: 0.9284 - val_loss: 0.3086 - val_binary_accuracy: 0.8718
Epoch 4/5
15000/15000 [==============================] - 2s 160us/step - loss: 0.1751 - binary_accuracy: 0.9435 - val_loss: 0.2836 - val_binary_accuracy: 0.8838
Epoch 5/5
15000/15000 [==============================] - 3s 169us/step - loss: 0.1426 - binary_accuracy: 0.9543 - val_loss: 0.2842 - val_binary_accuracy: 0.8870
Out[4]:
array([[0.22390722],
       [0.9989844 ],
       [0.76128745],
       ...,
       [0.07916231],
       [0.09766082],
       [0.4073096 ]], dtype=float32)

Task:

  • Plot the accuracy vs loss in both the training and validation data, on the history.history dictionary. Use more epochs. What do you notice? How many epochs do you think you need? What if you monitor for 100000 epochs?

  • We were using 2 hidden layers. Try to use 1 or 3 hidden layers and see how it affects validation and test accuracy.

  • Try to use layers with more hidden units or less hidden units: 32 units, 64 units...
  • Try to use the mse loss function instead of binary_crossentropy.
  • Try to use the tanh activation (an activation that was popular in the early days of neural networks) instead of relu.

CNN

(Convolutional Neural Networks)

Now that we saw in two simple examples how neural networks work, let's describe the two most famous types, CNNs and RNNs.

A CNN's filtering principle is based on the idea of functional convolution, this is a mathematical way of comparing two functions in a temporal manner by sliding one over the other. Conv2D and MaxPooling2D layer is a 3D tensor of shape (height, width, channels). The width and height dimensions tend to shrink as we go deeper in the network. The number of channels is controlled by the first argument passed to the Conv2D layers (e.g. 32 or 64). The next step would be to feed our last output tensor (of shape (3, 3, 64)) into a Dense net, flattening it first. For a nice visual way to see how this works in terms of tensor operations, see here:

The next layer in a convolutional network is called max pooling, downsampling or subsampling. The activation maps are fed into a downsampling layer, and like convolutions, this method is applied one patch at a time. In this case, max pooling simply takes the largest value from one patch of an image, places it in a new matrix next to the max values from other patches, and discards the rest of the information contained in the activation maps.


In [5]:
from keras.datasets import mnist
from keras.utils import to_categorical
from keras import layers
from keras import models

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype('float32') / 255
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=5, batch_size=64)


Epoch 1/5
60000/60000 [==============================] - 79s 1ms/step - loss: 0.1746 - acc: 0.9459
Epoch 2/5
60000/60000 [==============================] - 77s 1ms/step - loss: 0.0473 - acc: 0.9851
Epoch 3/5
60000/60000 [==============================] - 77s 1ms/step - loss: 0.0319 - acc: 0.9902
Epoch 4/5
60000/60000 [==============================] - 79s 1ms/step - loss: 0.0248 - acc: 0.9922
Epoch 5/5
60000/60000 [==============================] - 79s 1ms/step - loss: 0.0187 - acc: 0.9946
Out[5]:
<keras.callbacks.History at 0x7f108fe8d828>

In [6]:
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(test_loss, test_acc)
model.summary()


10000/10000 [==============================] - 6s 626us/step
0.03373222684136208 0.9902
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten_1 (Flatten)          (None, 576)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 64)                36928     
_________________________________________________________________
dense_11 (Dense)             (None, 10)                650       
=================================================================
Total params: 93,322
Trainable params: 93,322
Non-trainable params: 0
_________________________________________________________________

The number of parameters seen here are total trainable weights associated to the hidden units, plus a number of bias weights: (filter_height filter_width input_image_channels + 1) * number_of_filters. It is used as a measure of the complexity of the model. The convolutional network will filter the image in a sequence, gradually expanding the complexity of hidden features and eliminating the noise via the downsampling bottleneck.

Task:

  • draw accuracy plots per epoch and compare them to those obtained from the dense networks.

RNN

(Recurrent Neural Networks)

These networks process (loop) the information several times through every node. Such networks are mainly applied with the purpose of classifying sequential input and rely on backpropagation of error to do so. When the information passes a single time, the network is called feed-forward. Recurrent networks, on the other hand, take as their input not just the current input example they see, but also what they have perceived previously in time. Thus a RNN uses the concept of time and memory.

One could, for example, define the activation function on a hidden state in this manner, by a method called backpropagation through time: output_t = relu(dot(W, input) + dot(U, output.t-1))

A traditional deep neural network uses different parameters at each layer, while a RNN shares the same parameters across all steps. The output of each time step doesn't need to be kept (not necessarily). We not care for example while doing sentiment analysis about the output after every word.

Some more good stuff:

  • they can be bi-directional
  • they can be deep (multiple layers per time step)
  • RNNs can be combined with CNNs to solve complex problems, from speech or image recognition to machine translation.

In [10]:
from keras.datasets import imdb
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding, SimpleRNN

max_features = 10000  # number of words to consider as features
maxlen = 500  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')

print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)

model = Sequential()
model.add(Embedding(max_features, batch_size))
model.add(SimpleRNN(batch_size, return_sequences=True))
model.add(SimpleRNN(batch_size))  # This last layer only returns the last outputs.
model.add(Dense(1, activation='sigmoid'))
model.summary()

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(input_train, y_train,
                    epochs=5,
                    batch_size=128,
                    validation_split=0.2)


Loading data...
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
input_train shape: (25000, 500)
input_test shape: (25000, 500)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, None, 32)          320000    
_________________________________________________________________
simple_rnn_7 (SimpleRNN)     (None, None, 32)          2080      
_________________________________________________________________
simple_rnn_8 (SimpleRNN)     (None, 32)                2080      
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 33        
=================================================================
Total params: 324,193
Trainable params: 324,193
Non-trainable params: 0
_________________________________________________________________
Train on 20000 samples, validate on 5000 samples
Epoch 1/5
20000/20000 [==============================] - 56s 3ms/step - loss: 0.6061 - acc: 0.6486 - val_loss: 0.4220 - val_acc: 0.8168
Epoch 2/5
20000/20000 [==============================] - 53s 3ms/step - loss: 0.3853 - acc: 0.8341 - val_loss: 0.3564 - val_acc: 0.8486
Epoch 3/5
20000/20000 [==============================] - 53s 3ms/step - loss: 0.2935 - acc: 0.8830 - val_loss: 0.3734 - val_acc: 0.8326
Epoch 4/5
20000/20000 [==============================] - 53s 3ms/step - loss: 0.2129 - acc: 0.9194 - val_loss: 0.4741 - val_acc: 0.7706
Epoch 5/5
20000/20000 [==============================] - 55s 3ms/step - loss: 0.1427 - acc: 0.9478 - val_loss: 0.5521 - val_acc: 0.8180

Task:

  • Figure out how to use LSTMs to improve the model accuracy. Also figure out what LSTMs are :)

Common NN headaches

Overfitting: Gradient descent is the core of NN fitting. Optimization adjusts a model optimally on the training data, while the overfitting quantifies how bad the trained model would perform on data it has never seen before. The tension between fitness and generalization is what defines machine learning.

While the model is trained, it starts with being under-fit, but after more features are identified and more cleaning is done, the model starts to become over-fit and vallidation metrics reach a "stable state" or start degrading. The patterns that the model not learn are specific to the training data, but not to the test data. How do we combat this?

  • Add more data. The most succesfull AI today is trained on an umbelievably large set of labeled data. Somewhere deep underground, billions of minions work all day in their smart homes, driving smart cars, being smart about labeling their photos on FB. Similarly their machines tag single cell IDs on RNA probes.
  • Regularization. And yet the data is always not sufficient... Regularisation involves either modulate the quantity of information that your model is allowed to store, or to add constraints on what information it is allowed to store.
  • Hyperparametrization: DL models typically parametrize the model, which means that the fit is different when a different combination of parameters is used

Tasks:

  • Reduce network size in one of the previous NNs, and then increase it a lot. Plot this for a number of epochs and see when does overfitting occur.

In [ ]: