High-level neural network Python package using TensorFlow as a backend
Quickly build Neural Networks in a declarative manner
Great References:
Neural Networks have an input layer, 1 or more hidden layers, and an output layer. Each layer is mathematically doing matrix multiplication of the inputs by its weights and then producing an output to move through the NN. Each layer also has an activation function applied to the output of each node which helps determine how much its inputs contribute to the output. Deep Neural Networks would have many hidden layers and/or many nodes in each one.
There are differing types of layers which can be added, each one performing a different task for the overall NN architecture. Varying the sizes and types of layers will create specialized NNs. This colorful picture shows what kind of complexity can be created!
Add a forward layer and it's activation function with:
The Activation Function (AF) is a weighted, mathematical function applied to the inputs, which results in a final output "score". It could output infinity and completely outweigh any other node if they have smaller values, that's bad. A more practical approach is to apply the same AF to each node and normalize the range to something like -1 to 1. Now they're all on the same playing field in terms of contribution to the final output layer. The actual function can be just about anything, from simple to complex:
At the output layer, we want a slightly different result from our activation function. Using the Softmax Function, all of the outputs will be normalized into probabilities that add up to 1. This is good for a categorical probability distribution. It is different than a one-hot encoding scheme which would tell us definitively whether an input should be classified as. For example, if the NN has been trained well, then it should tell us there is a 90% probability that our input belongs to one class... and most likely it would share some features with other inputs, so other categories would get some of the probability distribution, but at lower values like 1-2%.
Add an AF to each forward layer in 1 of 2 ways:
Available activation functions:
Directory structure is important in Keras. Each of the 3 subsets of our data need to be in its own folder:
Within each folder, keras expects each class to be in its own directory. As an example, when doing cats vs. dogs classification, we will have the same number of sub-directories in each for however many output classes we are trying to map the input to (dogs + cats = 2). If there was a third animal, we'd have 3 sub-directories under each.
└───data
└───dogscats
├───train
│ ├───cats
│ └───dogs
├───valid
│ ├───cats
│ └───dogs
├───test
│ ├───cats
│ └───dogs
└───sample
├───train
│ ├───cats
│ └───dogs
└───valid
├───cats
└───dogs
It is a good idea to have a sample/ directory with much smaller train and validate subdirectories in order to quickly test basic code functionality without running a huge batch
At the basic level, sample data is fed into a NN to the input layer, the hidden layers have weights assigned to each node which perform matrix multiplication on their inputs to produce outputs, and finally when we reach the output layer we get a probability for each classification category. When training the NN, we know what the output should be so we can compare the output probability to the true value. During training, the NN is not 100% accurate and this is why we train on so many samples, so we can learn from our mistakes!
The layers of the NN are made up of matrices containing weights which have to start at some value. The process of training the neural network will gradually adjust these weights to more accuractely map the inputs to the correct outputs in our training set. It turns out the starting point isn't actually that important. We choose to randomly initialize all the weights in the matrices. When we start training, we will initially be far off and the training process will make larger adjustments towards the correct values; as we get closer to the final solution the adjustments will become less and less.
Applying a training example to the input, doing matrix multiplication, and propagating the numbers through the layers of a NN is the "feed forward" part of the process. The NN takes an input, performs some calculations, and gives us an output. Activation functions are applied at each layer which modify the results during this feed forward phase.
A loss function computes a number for how far off the result is from truth. If we tell the previous layers how far off the output was, then they can adjust their weights to more closely match, this is known as back-propagation. We could just use subtraction, but that's not great... it leads to small & large errors which can be positive and negative. It is better to calculate how far off we were by finding the sum of squared errors. The squared part will make the error positive and also penalize larger values. Now that we know what the error is, we can work on reducing it so our model will predict better!
Working backwards from the output layer towards the input layer, there should be some way to adjust the weights of those hidden, magical layers, so we get a more accurate output value. More math! Since we have a loss function which tells us the total error, we should be able to adjust the weights and then re-calculate the total error. Because we don't want to overcompensate, we will just tweak the values slightly and then check the new error value. If we move the weights down slightly and the error goes up, then that is bad. If we move them up slightly and the error goes down, then we are heading in the right direction!
But, which direction (up or down) and how slighyly should we do this? Well, since our loss function is probably some mathematically obscure looking shape, we know that it will not be a flat line. This means it will have rising and falling edges, think of a lumpy looking 3-dimensional bowl shape. Since we want to minimize our loss, we want to be at the bottom of the bowl! How do we get there from where we are? Enter the derivative, the rate of change of the loss function. A positive derivative would mean that increasing values results in an increased output, while a negative derivative means that increasing values result in a decreased output. In terms of our loss function, we want to reduce that output. So, this Gradient Descent process of subtracting the (derivative * learning rate) from the weights will give us a lower output, aka less error!
Now that we know which direction to move in order to reduce the error, how much should we move by? We'll use the factor called the learning rate to scale our adjustments. A number too large may overcompensate and make the error shift too far in the opposite direction, bouncing our results all over the place. Instead let's pick a smaller number and then make many small adjustments towards the lowest error. In practice, the learning rate can be tricky, so it's best to try different numbers. Starting with a lower learning rate will work, but can take too long.
At this point, we have input one training sample, calculated the loss, and used Gradient Descent to modify the weights. If we input the same training sample again, we would get a better score from our loss function. But we don't want to be really good at just that sample, so we do this process again for more training samples. It may make the weights worse for any one particular sample, but in general the NN will become better at processing ALL samples from the dataset, including new data we don't train on. We could perform this process for all the samples in the training set (Batch Gradient Descent), but that is computationally expensive and slow for really large datasets. Instead, we could do Mini-Batch Gradient Descent where the weights are recalculated for a smaller set of the data, but more often. Going even further, Stochastic Gradient Descent or "Online Learning" is faster as we recalculate the weights during each mini-batch for just one random sample. SGD is generally preferred because even though we are making smaller improvements compared to Batch Gradient Descent, we can make them much faster and take many more steps in the right direction.
After the weights for the last layer have been updated, they need to be propagated back through to the first hidden layer. They don't go all the way back to the first layer, because that's the input layer, aka our input data, and we aren't changing that. Now the weights at each layer of our NN are optimized based on the results of our most recent loss function calculation.
At this point, we are using the Gradient Descent optimizer "Stochastically" or in the "Online Learning" mode. It will lower the error calculated by the loss function and at some point we need to stop and say the results are good enough. We pick a point of Convergence when the loss function stops decreasing by any appreciable amount.
If we have 10,000 training samples which are images, then our GPU may not have enough VRAM to hold them all in memory. In this case, we need to train the dataset in different parts. Let's split the whole batch of training examples into 10 smaller mini-batches of 1,000 images. Each mini-batch will go through the NN (feed-forward) and then weights at each layer will be optimized during back-propagation. Each mini-batch that gets passed through forward and backprop is an iteration. It will take 10 iterations for the NN to train on our whole training set.
Once all the training data has passed through our NN, one epoch has been completed. This means all 10,000 samples have been fed-forward through the NN and then weights at each layer have been optimized during back-propagation via some sort of optimizing function (usually Stochastic Gradient Descent). Now, the weights at each layer in our NN should be pretty good at predicting the outputs. However, all the data can be passed through again, to further refine the weights. This sounds like a great idea ... so we run a second epoch (10 iterations of mini-batches with 1,000 images each). And then a 3rd and a 4th because, why not?! Now our NN is really good at classifying our training dataset!
However, it saw our training data so many times, that now the training images are the only thing it predicts well. Running new images through our NN and trying to infer what they are does not work out so well. Our model is overfit. It would have been a good idea to set aside some validation images so we could test for overfitting at each epoch. These validation images are used during the training process to find the point when our model has been trained well-enough, but not so well that it doesn't generalize to new images.
Using TensorFlow as the backend, we are able to take advantage of the GPU (or multiple GPUs) when training the NN. It would be inefficient to only look at 1 image at a time, and practically impossible to look at all images at once depending on the training set size. Instead, we look at a mini-batch amount at a time. We need to load multiple images to the GPU, but not so many that we run out of GPU memory.
Let's create a Sequential model made up of 2 layers
In [1]:
from keras.models import Sequential
model = Sequential()
In [2]:
from keras.layers import Dense, Activation
model.add(Dense(units=64, input_dim=100))
model.add(Activation('relu'))
model.add(Dense(units=10))
model.add(Activation('softmax'))
In [3]:
model.compile(loss='categorical_crossentropy',
optimizer='sgd',
metrics=['accuracy'])
In [4]:
# Generate dummy data
import numpy as np
data = np.random.random((1000, 100))
labels = np.random.randint(2, size=(1000, 1))
In [5]:
from keras.utils import to_categorical
one_hot_labels = to_categorical(labels, num_classes=10) # Convert labels to categorical one-hot encoding
In [6]:
model.fit(data, one_hot_labels, epochs=10, batch_size=32)
Out[6]:
In [7]:
model.summary()
In [ ]: