In [ ]:
# L2 regularization is a classic method to reudce over-fitting, and consists of adding 
# the weights of the model, multiplied by a given hyper-parameter (all eqns in this 
# article use python, numpy, and pytorch notation):
final_loss = loss + wd * all_weights.pow(2).sum() / 2

In [ ]:
# ...where wd is the hyper-parameter to set. This is also called weight decay, because 
# when applying vanilla SGD it's equivalent to updating the weight like this:
w = w - lr * w.grad - lr * wd * w

In [ ]:
# note that the derivative of w^2 wrt w is 2w. In this eqn we see how we subtract a 
# little portion of the weight at each step, hence the name 'decay'.

# Let's look at SGD with momentum for instance. Using L2 regularization consists of 
# adding wd*w to the gradients (as we saw earlier) but the gradients aren't 
# subtracted from the weights directly. First we compute a movign average:
moving_avg = alpha * moving_avg + (1-alpha) * (w.grad + wd*w)

In [ ]:
# ...and it's this moving average that'll be multiplied by the leraning rate and 
# substracted from w. So the part linekd to the regularization that'll be taken from 
# w is lr*(1-alpha)*wd*w plus a combination of the previous weights that were already 
# in moving_avg.
# 
# On the other hand, weight decay's update will look like:
moving_avg = alpha * mpving_avg + (1-alpha) * w.grad
w = w - lr*moving_avg - lr*wd*w

In [ ]:
avg_grads = beta1 * avg_grads + (1-beta1) * w.grad
avge_squared = beta2 * (avg_squared) + (1-beta2) * (w.grad ** 2)
w = w - lr * avg_grads / sqrt(avg_squared)

In [ ]:
# amsgrad
avg_grads = beta1 * avg_grads + (1-beta1) * w.grad
avg_squared = beta2 * (avg_squared) + (1-beta2) * (w.grad **2)
max_squared = max(avg_squared, max_squared)
w = w - lr * avg_grads / sqrt(max_squared)

In [ ]: