Keras Overview

High-level neural network Python package using TensorFlow as a backend

CPU or GPU versions
CNTK and Theano can be used as the backend instead

Quickly build Neural Networks in a declarative manner

i.e. tell Keras what kind of NN model/layers you want
Supports Convolutional (CNN) and Recurrent (RNN) neural networks (or a combo)
Using TensorFlow itself would give more control of how to build a fine-tuned and customized NN

Great References:

Models

Two types of implementations for Models:

The Sequential model is the standard way of building a NN by linearly stacking layers
The Model Class can be used to create a more complex NN using the Keras functional API

Layers

Overview

Neural Networks have an input layer, 1 or more hidden layers, and an output layer. Each layer is mathematically doing matrix multiplication of the inputs by its weights and then producing an output to move through the NN. Each layer also has an activation function applied to the output of each node which helps determine how much its inputs contribute to the output. Deep Neural Networks would have many hidden layers and/or many nodes in each one.

There are differing types of layers which can be added, each one performing a different task for the overall NN architecture. Varying the sizes and types of layers will create specialized NNs. This colorful picture shows what kind of complexity can be created!
title title

Keras Core Layer Usage

Add a forward layer and it's activation function with:

model.add(Dense(units=64, input_dim=100))
model.add(Activation('relu'))

Available core layers:

Dense: a standard, fully-connected, dense NN layer
Activation: see below. This is applied to the output of a layer and is a property of other layers
Dropout: randomly sets a fraction of inputs to 0 to prevent overfitting and speed up training
Flatten: reshape everything into a 1-D vector
Reshape: change the shape of the data to target_shape
Permute: switch one dimension to another position
RepeatVector: repeats the input n times
Lambda: apply a lambda function to the data (ex: transform all x -> x^2)
ActivityRegularization: apply L1 or L2 regularization to the cost function input
Masking: skips a timestep in downstream layers if a certain mask pattern is matched

Activation Functions

Overview

The Activation Function (AF) is a weighted, mathematical function applied to the inputs, which results in a final output "score". It could output infinity and completely outweigh any other node if they have smaller values, that's bad. A more practical approach is to apply the same AF to each node and normalize the range to something like -1 to 1. Now they're all on the same playing field in terms of contribution to the final output layer. The actual function can be just about anything, from simple to complex:

Step Function: simple and blunt, yes or no, output is 0 or 1. This is easy to understand, but not the best for a NN. This AF either says, "Yes, my input contributes info to the output, or no it does not." There is no maybe, a little bit, or sort of. This is not the best way to gather info from multiple sources because each one likely contributes something to the final result and we want to compile that info as we move through a deep neural network with many layers.
Linear Function: a straight line, input is proportional to output. This is better, but still bad for three reasons. (1) the output is not bound and could "blow up" to a huge number, or very tiny number, which isn't useful. (2) if every layer has a linear AF, then the whole system is linear and there's no point in having multiple layers, it's really just like one linear function once combined. (3) when we get to back propagation for gradient descent and minimizing the error, the gradient (derivative of the line) is constant and doesn't depend on the input
Sigmoid Function: 'S' shaped, this is better, it's not all or nothing like Step and it is non-linear so multiple layers are more meaningful. Small input changes (x) have a larger affect on the output (y), that's good! It causes this node to eventually figure out whether how to classify the output by drifting towards y=0 or y=1. However, towards the ends of the function (small or large input values) the output has very little change and is nearly flatlined. This means the node is set in its ways and nothing you say can change its mind! That may be good, or bad, but this AF is widely used.
Tanh Function: this is a scaled version of the Sigmoid function, so it has similar characteristics. The difference is that the gradients are even steeper
ReLu Function: this name is used all the time, what is it? The Rectified Linear Unit is linear if x is greater than 0, otherwise it outputs zero. Overall, it IS non-linear and it can speed up deep NN processing by making it less computationally dense. Meaning, if it outputs a zero, then there is less number crunching at the next layer because it's input is zero. However, with a flat-line zero gradient for x < 0, the node will stop responding to variations in input. This causes a dying ReLu neuron which makes this portion of the NN go passive. To remedy this, the flat line for x < 0 can be made into a small slope so it will gradually recover during training (Leaky ReLu) instead of causing that whole portion of the NN to become unresponsive to change.

Output Activations

At the output layer, we want a slightly different result from our activation function. Using the Softmax Function, all of the outputs will be normalized into probabilities that add up to 1. This is good for a categorical probability distribution. It is different than a one-hot encoding scheme which would tell us definitively whether an input should be classified as. For example, if the NN has been trained well, then it should tell us there is a 90% probability that our input belongs to one class... and most likely it would share some features with other inputs, so other categories would get some of the probability distribution, but at lower values like 1-2%.

Keras Activation Function Usage

Add an AF to each forward layer in 1 of 2 ways:

model.add(Dense(64, activation='tanh'))
model.add(Dense(64)); model.add(Activation('tanh')

Available activation functions:

elu
selu
softplus
softsign
relu
tanh
sigmoid
hardsigmoid
linear
softmax

Keras Directory Structure

Directory structure is important in Keras. Each of the 3 subsets of our data need to be in its own folder:

Train - data used to fit parameters for NN
Validate - data used to fine tune the NN parameters
Test - used to test the final model to see how well the NN generalizes to new data

Within each folder, keras expects each class to be in its own directory. As an example, when doing cats vs. dogs classification, we will have the same number of sub-directories in each for however many output classes we are trying to map the input to (dogs + cats = 2). If there was a third animal, we'd have 3 sub-directories under each.

 └───data
     └───dogscats
         ├───train
         │   ├───cats
         │   └───dogs
         ├───valid
         │   ├───cats
         │   └───dogs
         ├───test
         │   ├───cats
         │   └───dogs
         └───sample
             ├───train
             │   ├───cats
             │   └───dogs
             └───valid
                 ├───cats
                 └───dogs

It is a good idea to have a sample/ directory with much smaller train and validate subdirectories in order to quickly test basic code functionality without running a huge batch

Neural Network Training Process

At the basic level, sample data is fed into a NN to the input layer, the hidden layers have weights assigned to each node which perform matrix multiplication on their inputs to produce outputs, and finally when we reach the output layer we get a probability for each classification category. When training the NN, we know what the output should be so we can compare the output probability to the true value. During training, the NN is not 100% accurate and this is why we train on so many samples, so we can learn from our mistakes!

Random Initialization

The layers of the NN are made up of matrices containing weights which have to start at some value. The process of training the neural network will gradually adjust these weights to more accuractely map the inputs to the correct outputs in our training set. It turns out the starting point isn't actually that important. We choose to randomly initialize all the weights in the matrices. When we start training, we will initially be far off and the training process will make larger adjustments towards the correct values; as we get closer to the final solution the adjustments will become less and less.

Feed Forward

Applying a training example to the input, doing matrix multiplication, and propagating the numbers through the layers of a NN is the "feed forward" part of the process. The NN takes an input, performs some calculations, and gives us an output. Activation functions are applied at each layer which modify the results during this feed forward phase.

Calculate Loss Function

A loss function computes a number for how far off the result is from truth. If we tell the previous layers how far off the output was, then they can adjust their weights to more closely match, this is known as back-propagation. We could just use subtraction, but that's not great... it leads to small & large errors which can be positive and negative. It is better to calculate how far off we were by finding the sum of squared errors. The squared part will make the error positive and also penalize larger values. Now that we know what the error is, we can work on reducing it so our model will predict better!

Derivative of Error

Working backwards from the output layer towards the input layer, there should be some way to adjust the weights of those hidden, magical layers, so we get a more accurate output value. More math! Since we have a loss function which tells us the total error, we should be able to adjust the weights and then re-calculate the total error. Because we don't want to overcompensate, we will just tweak the values slightly and then check the new error value. If we move the weights down slightly and the error goes up, then that is bad. If we move them up slightly and the error goes down, then we are heading in the right direction!

But, which direction (up or down) and how slighyly should we do this? Well, since our loss function is probably some mathematically obscure looking shape, we know that it will not be a flat line. This means it will have rising and falling edges, think of a lumpy looking 3-dimensional bowl shape. Since we want to minimize our loss, we want to be at the bottom of the bowl! How do we get there from where we are? Enter the derivative, the rate of change of the loss function. A positive derivative would mean that increasing values results in an increased output, while a negative derivative means that increasing values result in a decreased output. In terms of our loss function, we want to reduce that output. So, this Gradient Descent process of subtracting the (derivative * learning rate) from the weights will give us a lower output, aka less error!

Now that we know which direction to move in order to reduce the error, how much should we move by? We'll use the factor called the learning rate to scale our adjustments. A number too large may overcompensate and make the error shift too far in the opposite direction, bouncing our results all over the place. Instead let's pick a smaller number and then make many small adjustments towards the lowest error. In practice, the learning rate can be tricky, so it's best to try different numbers. Starting with a lower learning rate will work, but can take too long.

At this point, we have input one training sample, calculated the loss, and used Gradient Descent to modify the weights. If we input the same training sample again, we would get a better score from our loss function. But we don't want to be really good at just that sample, so we do this process again for more training samples. It may make the weights worse for any one particular sample, but in general the NN will become better at processing ALL samples from the dataset, including new data we don't train on. We could perform this process for all the samples in the training set (Batch Gradient Descent), but that is computationally expensive and slow for really large datasets. Instead, we could do Mini-Batch Gradient Descent where the weights are recalculated for a smaller set of the data, but more often. Going even further, Stochastic Gradient Descent or "Online Learning" is faster as we recalculate the weights during each mini-batch for just one random sample. SGD is generally preferred because even though we are making smaller improvements compared to Batch Gradient Descent, we can make them much faster and take many more steps in the right direction.

Backpropagate and Weight Update

After the weights for the last layer have been updated, they need to be propagated back through to the first hidden layer. They don't go all the way back to the first layer, because that's the input layer, aka our input data, and we aren't changing that. Now the weights at each layer of our NN are optimized based on the results of our most recent loss function calculation.

Convergence

At this point, we are using the Gradient Descent optimizer "Stochastically" or in the "Online Learning" mode. It will lower the error calculated by the loss function and at some point we need to stop and say the results are good enough. We pick a point of Convergence when the loss function stops decreasing by any appreciable amount.

Training Example

If we have 10,000 training samples which are images, then our GPU may not have enough VRAM to hold them all in memory. In this case, we need to train the dataset in different parts. Let's split the whole batch of training examples into 10 smaller mini-batches of 1,000 images. Each mini-batch will go through the NN (feed-forward) and then weights at each layer will be optimized during back-propagation. Each mini-batch that gets passed through forward and backprop is an iteration. It will take 10 iterations for the NN to train on our whole training set.

Once all the training data has passed through our NN, one epoch has been completed. This means all 10,000 samples have been fed-forward through the NN and then weights at each layer have been optimized during back-propagation via some sort of optimizing function (usually Stochastic Gradient Descent). Now, the weights at each layer in our NN should be pretty good at predicting the outputs. However, all the data can be passed through again, to further refine the weights. This sounds like a great idea ... so we run a second epoch (10 iterations of mini-batches with 1,000 images each). And then a 3rd and a 4th because, why not?! Now our NN is really good at classifying our training dataset!

However, it saw our training data so many times, that now the training images are the only thing it predicts well. Running new images through our NN and trying to infer what they are does not work out so well. Our model is overfit. It would have been a good idea to set aside some validation images so we could test for overfitting at each epoch. These validation images are used during the training process to find the point when our model has been trained well-enough, but not so well that it doesn't generalize to new images.

Keras Model Execution with TensorFlow

Using TensorFlow as the backend, we are able to take advantage of the GPU (or multiple GPUs) when training the NN. It would be inefficient to only look at 1 image at a time, and practically impossible to look at all images at once depending on the training set size. Instead, we look at a mini-batch amount at a time. We need to load multiple images to the GPU, but not so many that we run out of GPU memory.

Example Sequential Model

Let's create a Sequential model made up of 2 layers

The 1st layer must have a defined input shape: input_shape=(X,) or input_dim=X *
The 2nd layer (output layer) - following layers can infer what shape they should be



In [1]:

    
from keras.models import Sequential
model = Sequential()









    



Using TensorFlow backend.



In [2]:

    
from keras.layers import Dense, Activation
model.add(Dense(units=64, input_dim=100))
model.add(Activation('relu'))
model.add(Dense(units=10))
model.add(Activation('softmax'))



In [3]:

    
model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])



In [4]:

    
# Generate dummy data
import numpy as np
data = np.random.random((1000, 100))
labels = np.random.randint(2, size=(1000, 1))



In [5]:

    
from keras.utils import to_categorical
one_hot_labels = to_categorical(labels, num_classes=10) # Convert labels to categorical one-hot encoding



In [6]:

    
model.fit(data, one_hot_labels, epochs=10, batch_size=32)









    



Epoch 1/10
1000/1000 [==============================] - 1s - loss: 1.4575 - acc: 0.4450     
Epoch 2/10
1000/1000 [==============================] - 0s - loss: 0.8436 - acc: 0.4890     
Epoch 3/10
1000/1000 [==============================] - 0s - loss: 0.7721 - acc: 0.4840     
Epoch 4/10
1000/1000 [==============================] - 0s - loss: 0.7451 - acc: 0.5140     
Epoch 5/10
1000/1000 [==============================] - 0s - loss: 0.7359 - acc: 0.4850     
Epoch 6/10
1000/1000 [==============================] - 0s - loss: 0.7255 - acc: 0.4970     
Epoch 7/10
1000/1000 [==============================] - 0s - loss: 0.7197 - acc: 0.5100     
Epoch 8/10
1000/1000 [==============================] - 0s - loss: 0.7164 - acc: 0.5110     
Epoch 9/10
1000/1000 [==============================] - 0s - loss: 0.7131 - acc: 0.5160     
Epoch 10/10
1000/1000 [==============================] - 0s - loss: 0.7113 - acc: 0.5190     






    Out[6]:





<keras.callbacks.History at 0x21887847518>



In [7]:

    
model.summary()









    



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 64)                6464      
_________________________________________________________________
activation_1 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                650       
_________________________________________________________________
activation_2 (Activation)    (None, 10)                0         
=================================================================
Total params: 7,114
Trainable params: 7,114
Non-trainable params: 0
_________________________________________________________________



In [ ]: