How is the animal brain capable of deciding a type from a sequence of sensory inputs? What defines a dog in the image of a dog? What in the wording will make you say this is a happy person who pretends being sad? How many pictures of a dog must a child see in order to correctly label a dog? Deep learning, from an algorithmic perspective, is the application of advanced multi-layered filters to learn hidden features in data representation. Many of the methods that are used today in DL, such as most neural network types (and not only), went through a 20 years long pause due to the fact that the computing machines avalable at the era were too slow to produce wanted results. It was several things that precipitated their return in 2010:
As with everything else in this course, DL is a science in itself, and contains thousands pages of theory. For an in-depth reading we recommend this book, written by some of the early specialists in neural network: http://www.deeplearningbook.org/. It is today possibly the top bestseller in the field of machine learning (also, it is free).
Driven by practicality as we are for the purpose of this course, we will dwelve directly into an example of using DL. Python was an early favorite for DL, and initial libraries such as Theano started an approach based on defining the problem via Python, then compiling a program that can be executed faster than a regular code on either CPU and GPU. Later in 2013 Google made Tensorflow available as open source library and different other DL libraries have mushroomed. The difficulty of operating such libraries comes from their specialist language and the Pythonic way to solve this and open AI to everyone was to make a library that would be able to use various DL engines, while simplifying the API. Thus came into being Keras.
$ conda install keras
In [2]:
import keras
In [5]:
from keras.datasets import mnist
from keras import models
from keras import layers
from keras.utils import to_categorical
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
# reshape and scale images
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255
# convert labels
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
# defining the NN structure
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))
network.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
Explanations:
Read more:
In [6]:
network.fit(train_images, train_labels, epochs=5, batch_size=128)
Out[6]:
In [8]:
test_loss, test_acc = network.evaluate(test_images, test_labels)
print(test_loss, test_acc)
Observations:
Questions:
Reading:
The purpose is to cathegorize films into good or bad based on their reviews. Data is vectorized into binary.
layer activation
What happens during layer activation? Basically a set of tensor operations are being performed. A simplistic way to understand this is operations done on array of matrices, while the atomic operation would be:
output = relu(dot(W, input) + b)
, where the weight matrix W shape is (input_dim (10000), 16) and b is a bias term. In linear algebra terms, this will project teh input data onto a 16 dimensional space. The more dimensions, the more features, the more confusion, and more computing cost BUT also more complex representations. What visual features define a dog?
In [4]:
import numpy as np
from keras.datasets import imdb
from keras import models
from keras import layers
from keras import optimizers
from keras import losses
from keras import metrics
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
print(max([max(sequence) for sequence in train_data]))
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
loss=losses.binary_crossentropy,
metrics=[metrics.binary_accuracy])
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]
history = model.fit(partial_x_train,
partial_y_train,
epochs=5,
batch_size=512,
validation_data=(x_val, y_val))
model.predict(x_test)
Out[4]:
Task:
Plot the accuracy vs loss in both the training and validation data, on the history.history dictionary. Use more epochs. What do you notice? How many epochs do you think you need? What if you monitor for 100000 epochs?
We were using 2 hidden layers. Try to use 1 or 3 hidden layers and see how it affects validation and test accuracy.
Now that we saw in two simple examples how neural networks work, let's describe the two most famous types, CNNs and RNNs.
A CNN's filtering principle is based on the idea of functional convolution, this is a mathematical way of comparing two functions in a temporal manner by sliding one over the other. Conv2D and MaxPooling2D layer is a 3D tensor of shape (height, width, channels). The width and height dimensions tend to shrink as we go deeper in the network. The number of channels is controlled by the first argument passed to the Conv2D layers (e.g. 32 or 64). The next step would be to feed our last output tensor (of shape (3, 3, 64)) into a Dense net, flattening it first. For a nice visual way to see how this works in terms of tensor operations, see here:
The next layer in a convolutional network is called max pooling, downsampling or subsampling. The activation maps are fed into a downsampling layer, and like convolutions, this method is applied one patch at a time. In this case, max pooling simply takes the largest value from one patch of an image, places it in a new matrix next to the max values from other patches, and discards the rest of the information contained in the activation maps.
In [5]:
from keras.datasets import mnist
from keras.utils import to_categorical
from keras import layers
from keras import models
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype('float32') / 255
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=5, batch_size=64)
Out[5]:
In [6]:
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(test_loss, test_acc)
model.summary()
The number of parameters seen here are total trainable weights associated to the hidden units, plus a number of bias weights: (filter_height filter_width input_image_channels + 1) * number_of_filters. It is used as a measure of the complexity of the model. The convolutional network will filter the image in a sequence, gradually expanding the complexity of hidden features and eliminating the noise via the downsampling bottleneck.
Task:
These networks process (loop) the information several times through every node. Such networks are mainly applied with the purpose of classifying sequential input and rely on backpropagation of error to do so. When the information passes a single time, the network is called feed-forward. Recurrent networks, on the other hand, take as their input not just the current input example they see, but also what they have perceived previously in time. Thus a RNN uses the concept of time and memory.
One could, for example, define the activation function on a hidden state in this manner, by a method called backpropagation through time: output_t = relu(dot(W, input) + dot(U, output.t-1))
A traditional deep neural network uses different parameters at each layer, while a RNN shares the same parameters across all steps. The output of each time step doesn't need to be kept (not necessarily). We not care for example while doing sentiment analysis about the output after every word.
Some more good stuff:
In [10]:
from keras.datasets import imdb
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding, SimpleRNN
max_features = 10000 # number of words to consider as features
maxlen = 500 # cut texts after this number of words (among top max_features most common words)
batch_size = 32
print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')
print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)
model = Sequential()
model.add(Embedding(max_features, batch_size))
model.add(SimpleRNN(batch_size, return_sequences=True))
model.add(SimpleRNN(batch_size)) # This last layer only returns the last outputs.
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(input_train, y_train,
epochs=5,
batch_size=128,
validation_split=0.2)
Task:
Overfitting: Gradient descent is the core of NN fitting. Optimization adjusts a model optimally on the training data, while the overfitting quantifies how bad the trained model would perform on data it has never seen before. The tension between fitness and generalization is what defines machine learning.
While the model is trained, it starts with being under-fit, but after more features are identified and more cleaning is done, the model starts to become over-fit and vallidation metrics reach a "stable state" or start degrading. The patterns that the model not learn are specific to the training data, but not to the test data. How do we combat this?
Tasks:
In [ ]: