We discussed about the importance of learning rates and need for annealing the learning rates with Stochastic Gradient Descent. SGD is a convex optimization technique and it converges to stationary point. A stationary point may not be a convergence point for every problem. A solution space with sparse gradients will have a type of stationary points called saddle points. Saddle points are points with very low gradient or near zero gradient for few iterations and high gradients after that.
This problem can be solved by updating parameters with a historical average of the gradients instead of the current gradient. Momentum update is calculated by taking a portion of the previous momentum and subtracting the current gradient (scaled by learning rate) from it. an equation for regular momentum update is shown below. $$v_{t+1} = \mu v_t - \eta f'(\theta _t)$$ $$\theta _{t+1} = \theta _t + v_{t+1}$$
$v_t$ is the momentum update in the previous time step and $\theta _t$ is the current gradient. $\eta$ is the learning rate and $\mu$ is the momentum coefficient. If you consider previous momentum as a vector and currect gradient as another vector, we are adjusting the direction of the momentum vector using the current gradient.
Nestrov Accelerated Gradient/Nestrov Momentum is another type of momentum update which is highly used. In polyak momentum we calculated the gradient and then adjusted the momentum with the gradient. However in Nestrov momentum we adjust the gradient before calculating the momentum. Nestrov Momentum can be respresented mathematically using the following equations: $$v_{t+1} = \mu v_t - \eta f'(\theta _t + \mu v_t)$$ $$\theta _{t+1} = \theta _t + v_{t+1}$$
YANN supports Polyak and Nestrov momentum. Momentum can be added to the network using the following
optimizer_params = {
"momentum_type" : <option> 'false' <no momentum>, 'polyak', 'nesterov'.
"momentum_params" : (<option in range [0,1]>, <option in range [0,1]>, <int>)
"optimizer_type" : <option>, 'sgd', 'adagrad', 'rmsprop', 'adam'.
"id" : id of the optimizer
}
momentum_type
is the type of momentum(polyak/nestrov) to be used. If you don't want to use any momentum you can give false to it. The default value for momentum_type is false. momentum_params
is a 3 tuple and takes values for momentum coeffient at start,at end, at what epoch to end momentum increase and it takes a default value of (0.5, 0.95,50). Optimizer_type
is the type of the optimizer to be used. YANN supports sgd, adagrad, rmsprop and adam. We discussed about sgd in the previous tutorial and we will discuss other optimizers later in this chapter.
Let's use the MLP network we created in the previous tutorial and train it with no momentum, polak momentum and nestrov momentum and analyze the results
In [1]:
from yann.network import network
from yann.special.datasets import cook_mnist
import matplotlib.pyplot as plt
def get_cost():
costs = []
with open('./resultor/costs.txt') as costf:
costs = [float(cost.rstrip()) for cost in costf]
return costs
def plot_costs(costs, labels):
for cost, label in zip(costs, labels):
plt.plot(cost,label=label)
plt.legend()
plt.show()
costs = []
data = cook_mnist()
dataset_params = { "dataset": data.dataset_location(), "id": 'mnist', "n_classes" : 10 }
def mlp(dataset_params, optimizer_params, optimizer_id):
net = network()
net.add_layer(type = "input", id ="input", dataset_init_args = dataset_params)
net.add_layer (type = "dot_product",
origin ="input",
id = "dot_product_1",
num_neurons = 800,
regularize = True,
activation ='relu')
net.add_layer (type = "dot_product",
origin ="dot_product_1",
id = "dot_product_2",
num_neurons = 800,
regularize = True,
activation ='relu')
net.add_layer ( type = "classifier",
id = "softmax",
origin = "dot_product_2",
num_classes = 10,
activation = 'softmax',
)
net.add_layer ( type = "objective",
id = "nll",
origin = "softmax",
)
net.add_module ( type = 'optimizer', params = optimizer_params )
learning_rates = (0.05, 0.01, 0.001)
net.cook( verbose = 0,
optimizer = optimizer_id,
objective_layer = 'nll',
datastream = 'mnist',
classifier = 'softmax',
)
net.train(verbose=0,
epochs = (20, 20),
validate_after_epochs = 2,
training_accuracy = True,
learning_rates = learning_rates,
show_progress = True,
early_terminate = True)
return net
The above function returns a trained network given the dataset_params and optimizer_params. Let's train this network without any momentum
In [2]:
optimizer_params = {
"momentum_type" : 'false',
"momentum_params" : (0.9, 0.95, 30),
"regularization" : (0.0001, 0.0002),
"optimizer_type" : 'sgd',
"id" : 'sgd'
}
net = mlp(dataset_params, optimizer_params, 'sgd')
costs.append(get_cost())
In [3]:
labels = ['no momentum']
plot_costs(costs, labels)
Let's train the same network with polyak momentum
In [4]:
optimizer_params = {
"momentum_type" : 'polyak',
"momentum_params" : (0.9, 0.95, 30),
"regularization" : (0.0001, 0.0002),
"optimizer_type" : 'sgd',
"id" : 'sgd-polyak'
}
net = mlp(dataset_params, optimizer_params, 'sgd-polyak')
costs.append(get_cost())
In [5]:
labels = ['no momentum', 'polyak']
plot_costs(costs, labels)
Let's train the same network with Nestrov momentum
In [6]:
optimizer_params = {
"momentum_type" : 'nesterov',
"momentum_params" : (0.9, 0.95, 30),
"regularization" : (0.0001, 0.0002),
"optimizer_type" : 'sgd',
"id" : 'sgd-nesterov'
}
net = mlp(dataset_params, optimizer_params, 'sgd-nesterov')
costs.append(get_cost())
In [ ]:
labels = ['no momentum', 'polyak', 'nesterov']
plot_costs(costs, labels)
If you see the plot above you can observe that polyak and nesterov momentum helped to converge faster compared to the one with no momentum. In particular example, polyak and nesterov performed similarly. Hoewever in general nesterov performs better and is more stable.
Momentum solves the saddle point problem by adding a temporal average of the previous updates. One problem with momentum is that it accelerates the gradients for every direction(dimension). Imagine a scenario where we need have the minimum in the direction where we have less gradients and high gradients in other direction. Using momentum makes it faster in other direction and it maintains the slow pace in the direction we actually need to move.
Adagrad is another optimizer we are borrowing from convex optimization. Adagrad promises convergence given a convex setting. We have learnt few optimization techniques before. If you recall, we used same learning rate for all parameters. Adagrad adapts the learning rate per parameter. Parameter update in Adagrad is done using the following equation. $$\theta _{j} = \theta _j + \frac{\eta}{\sqrt{G_{j,j}}} g_j$$
The above equation has only one new component $\sqrt{G_{j,j}}$. $G_{j,j}$ is the sum of the squares of the previous gradients. This term acts as a smoothing factor. If the gradient is constantly high in a direction then $G_{j,j}$ is also high which reduces the overall learning rate in that direction and vice versa.
Let's try the previous example using Adagrad.
In [8]:
optimizer_params = {
"momentum_type" : 'false',
"regularization" : (0.0001, 0.0002),
"optimizer_type" : 'adagrad',
"id" : 'adagrad'
}
net = mlp(dataset_params, optimizer_params, 'adagrad')
costs.append(get_cost())
In [9]:
labels = ['sgd','sgd-polyak','sgd-nestrov', 'adagrad']
plot_costs(costs, labels)
The above plot shows a comparison between sgd(with momentum) and adagrad where adagrad converged significantly faster compared to other algorithms. You can use polyak and nestrov momentum with Agadrad.
We saw that the Agadrad got all those nice features from the denominator term $\sqrt{G_{j,j}}$. $G_{j,j}$. However, this term increases with time consequently, learning rate reduces with time. It can reach a point where through the gradient is high, low learning rate can stop the learning. In simple term Adagrad forces the convergence. Therefore this technique is not immune to Saddle point problem. Let's look at the most popular neural network training technique.
rmsprop algorithm is proposed by Hinton which solves teh adagrad problem of forced convergence by taking weighted average of historical gradients so that the current update is only impacted by the past few gradients unlike every past gradient in Adagrad. $$H_t = \gamma H_{t-1} + (1-\gamma )g_j^2$$ $$\theta _{j} = \theta _j + \frac{\eta}{\sqrt{H_t}} g_j$$ In the above equations $\gamma$ is called the forgetting factor because scales down the $H_{t-1}$, the historical gradient. As the historical gradient is scaled down using this forgetting factor it is immune to saddle points. As rmsprop implements weighted historic average it is also not affected much by sudden changes in the gradient. Therefore it is works well with on-line and non-stationary optimization settings.
Let's see rmsprop in action:
In [10]:
optimizer_params = {
"momentum_type" : 'false',
"regularization" : (0.0001, 0.0002),
"optimizer_type" : 'rmsprop',
"id" : 'rmsprop'
}
net = mlp(dataset_params, optimizer_params, 'rmsprop')
costs.append(get_cost())
In [11]:
labels = ['sgd','sgd-polyak','sgd-nestrov', 'adagrad','rmsprop']
plot_costs(costs, labels)
Above plot shows that rmsprop reaches convergence marginally faster compared to adagrad.
We have rmsprop that works well in on-line and non-stationary optimization setting ang we have adagrad that works well with sparse gradients. What if we have a single algorithm that achieves both? we are going to learn about an optimizer that achieves both in the next section
Adam is relatively new algorithm. It got published in 2015. Adam is immune to sparse gradients and non-stationary optimization techniques. It has automatic annealing, It adapts learning rate to each parameter. Let's see the mathematics behind adam.
moment calculation: $$m_t = \beta_1 m_{t-1} + (1-\beta_1 )g_t$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2 )g_t^2$$
Bias Update: $$\hat{m_t} = \frac{m_t}{1-\beta_1^t}$$ $$\hat{v_t} = \frac{v_t}{1-\beta_2^t}$$
Parameter update: $$\theta _{t+1} = \theta _t + \frac{\eta}{\sqrt{\hat{v_t}} + \epsilon } \hat{m_t}$$
As shown above adam has three steps ie., moment calculation, bias update and parameter update. In the moment calculation step, adam calculates first order and second order moments. first order moment is the temporal weighted average of the historical gradients and the second order moment is the temporal weighted average of square of historical gradients. $\beta _1$ and $\beta _2$ are coefficients similar to forgetting factor in rmsprop and they are close to 1. Authors who proposed this technique suggested to use 0.9 for $\beta _1$ and 0.999 for $\beta _2$. Therefore, $\beta _1$ and $\beta _2$ makes $m_t$ and $v_t$ biased towards the historical average because the weight $1-\beta_1$ and $1-\beta_2$ for the current gradient is near to zero. At the start of the learning $m_0$ = 0; $v_0$ = 0 which makes the updates biased to zero. Therefore to correct that we use that bias correction term. AS $\beta _1$ < 1 after few iterations $\beta_1^t \approx$ 0 and the bias correction will not effect the moments. Parameter update is equivalent to rmpsprop with polyak momentum becase first order moment is equivalent to polyak momentum and the second order moment is the denominator term in rms prop are same. $\epsilon$ is a small constand used to prevent divide-by-zero disaster.
Let's see Adam in Action:
In [13]:
optimizer_params = {
"momentum_type" : 'false',
"regularization" : (0.0001, 0.0002),
"optimizer_type" : 'adam',
"id" : 'adam'
}
net = mlp(dataset_params, optimizer_params, 'adam')
costs.append(get_cost())
In [14]:
labels = ['sgd','sgd-polyak','sgd-nestrov', 'adagrad','rmsprop','adam']
plot_costs(costs, labels)
The above plot shows that adam performs slightly better compared to rmsprop.
To conclude this discussion of optimizers, this tutorial follows a path for introducing new optimizers. However that doesn't mean that adam works better than other techniques in every scenario. There may be many cases where adagrad may overperform rmsprop/adam. It is advised to analyze these technqiues and use them based on the requirement.