Regression functions predict a quantity, and classification functions predict a label.
Supervised learning - { input, correct output }
Unsupervised learning - { input } - contains only inputs without correct outputs
Reinforcement learning - { input, some output, grade for this output }
Offline learning
Online learning
Supervised learning - Regression and Classification
Unsupervised learning - Clustering, Dimension Reduction, Anomaly Detection
Reinforcement learning - Markov decision process, Q-learning, Temporal Difference methods, Monte-Carlo methods
Supervised learning is usually divided into two types of learning. Logistic regression is not a regression algorithm but a classification algorithm.However it is considered a regression model in statistics and the machine learning community just adopted it and began using it as a classifier.
Multi-Class Classification
One-of-many classification. Each sample can belong to ONLY ONE of $C$ classes. The NN will have $C$ output neurons that can be gathered in a vector s (Scores). The target (ground truth) vector $t$ will be a one-hot vector with a positive class and $C−1$ negative classes.
This task is treated as a single classification problem of samples in one of $C$ classes.
Multi-Label Classification
Each sample can belong to more than one class. The NN will have as well $C$ output neurons. The target vector $t$ can have more than a positive class, so it will be a vector of 0s and 1s with $C$ dimensionality.
This task is treated as $C$ different binary and independent classification problems, where each output neuron decides if a sample belongs to a class or not.
Mutli-class means you choose from a number of mutually exclusive classes. Good example is dog’s breed classification. A dog cannot be of two breeds at the same time.
While the planet competition in the course is a an example of multi-label classification. One image can have 2 or more labels at the same time. It is perfectly Ok for a single photograph to contain both “Agriculture” and “River” while it is “Cloudy”.
To sumarize:
multi-class: a sample can be one of N classes where N>2 (e.g. ImageNet1k)
For multi-class (one ground truth label), softmax + cross entropy loss works well. Use F.cross_entropy or nn.CrossEntropyLoss
multi-label: a sample can be labeled with more than one class (e.g. video classification)
For multi-label (multiple ground truth labels), sigmoid + cross entropy per label works well.Use F. binary_cross_entropy_with_logits or nn.BCEWithLogitsLoss
Family of Texture Metrics (Features Descriptors from Classical Approches)
The above feature descriptors are manually extracted and fed into a feature representation. They are then passed to a traditional machine learning classifier.
An artificial neural network on the other hand is a fully connected neural network where for example in-case of images each pixel is fed as a feature.
Manually extracted features can be also be fed into a fully connected neural network.
Family of Deep Neural Networks:
Fully Connected Networks
-Autoencoders
-Belief networks
Convolutional Networks
-Conv-Nets
-LeNet, GoogLeNet, AlexNet
-U-Nets
-Res-Nets
Recurrent Neural Networks
-Long Short Term Memory (LSTMs)
Typically a machine learning setup looks as below:
In mathematics, three interelated functions exponential, logistic, and logarithmic are commonly used in wide variety of applications. In deep learning, we tend to use these functions in one way or the other.
Exponential functions model growth and decay over time, such as unrestricted population growth and the decay of radioactive substances.
Logistic functions model restricted population growth, certain chemical reactions, and the spread of rumors and diseases.
Logarithmic functions are the basis of the Richter scale of earth-quake intensity, the pH acidity scale, and the decibel measurement of sound.
There are also algebraic functions comprising of polynomial functions, rational functions, and power functions with rational exponents
Let $a$ and $b$ be real number constants. An exponential function in $x$ is a function that can be written in the form
$$f(x) = a \cdot b^x$$where $a$ is nonzero, $b$ is positive, and $b \neq 1$ . The constant $a$ is the initial value of $f$ (the value at $x=0$), and $b$ is the base.
Any exponential function can be expressed in terms of the natural base $e$.
We are usually more interested in the exponential function $f(x) = e^x$ and variations of this function than in the irrational number $e$ itself.
Any exponential function $$f(x) = a \cdot b^x$$ can be re-expressed in $e$ as $$f(x) = a \cdot e^{kx}\\$$
If $a>0$ and $k>0$, $f(x)=a\cdot e^{kx}$ is an exponential growth function.
If $a>0$ and $k<0$, $f(x)=a\cdot e^{kx}$ is an exponential decay function.
Exponential growth is unrestricted. An exponential growth function increases at an ever-increasing rate and is not bounded above. In many growth situations, however, there is a limit to the possible growth.
A plant can only grow so tall.
The number of goldfish in an aquarium is limited by the size of the aquarium. In such situations the growth often begins in an exponential manner, but the growth eventually slows and the graph levels out. The associated growth function is bounded both below and above by horizontal asymptotes.
Let $a,b,c,$ and $k$ be positive constants, with $b < 1$.
A logistic growth function $x$ is a function that can be written in the form
$$f(x) = \frac{c}{1+a \cdot b^x} = \frac{c}{1+a \cdot e^{-kx}}$$where the constant $c$ is the limit to growth.
For $b>1$ or $k<0$, these formulas yield logistic decay functions.
By setting $a=c=k=1$, we obtain the logistic function
$$ f(x)= \frac{1}{1+e^{-x}}\\$$The inverse of the function $f(x)=b^x$ is the logarithmic function with base b, denoted as $\log_b(x)$ or simply as $\log_bx\\$.
If $f(x)=b^x$ with $b>0$ and $b \neq 1$ then $f^{-1} (x)=\log_b x$
Logarithms with base 10 are called common logarithms. Because of their connection to base-ten number system, the metric system, and scientific notation, common logarithms are especially useful. We often drop the subscript of 10 for the base when using common logarithms.
Hence the common logarithmic function $\log_{10}x = \log x$ where it is the inverse of the exponential function $f(x)=10^x$
Because of their special calculus properties, logarithms with the natural base $e$ are used in many situations. Logarithms with base $e$ are natural logarithms. We often use the special abbreviation "ln" (without a subscript) to denote a natural logarithm.
Thus, the natural logarithmic function $\log_{e}x = ln \;x$ where it is the inverse of the exponential function $f(x)=e^x$
Properties of Logarithms:
Let $b$, $R$, and $S$ be positive real numbers with $b \neq 1$, and $c$ any real number.
Product rule: $$ \log_b (RS) = \log_b R + \log_b S$$
$$ \log (2 \cdot 4) = \log 2 + \log 4$$
Quotient rule: $$ \log_b \frac{R}{S} = \log_b R - \log_b S$$
$$ \log \frac{8}{2} = \log 8 - \log 2$$
Power rule: $$ \log_b R^c = c \log_b R$$
$$ \log 2^3 = 3 \log 2\\$$
Note: Data pairs $(x, y)$ that fit a power model can have a linear relationship when re-expressed as $(ln \;x, ln \;y)$ pairs. Similarily data pairs $(x, y)$ that fit a logarithmic or exponential regression model can also be linearized through logarithmic re-expression. Thus the following are regression Models related by logarithmic Re-expression
An appropriate model can be choosen based on data.
Activation function defines the output of a neuron given an input. Based on the output of the activation function, the neurons decides whether to pass the information or not.
Golden rule: Activation functions solve the open range problem (unrestricted growth problem) of inputs and weights.
The prediction for say categorical problems, may require true or false or yes or no; and where the output cannot be just any value between -infinity to +infinity but must be in a certain range. So we will need to map it down to some sort of zero to one problem.
Activation function or transfer function is some sort of curve that takes whatever the sum, the neuron produces and converts it to some number within the range of the transfer function say between -1 and +1 or between 0 & 1.
One simple way of doing above is say using a threshold, if the value is > 0 then make it one and if the value is < 0 the make it zero.
Step function is a simple activation function which returns either 0 or 1. It represents whether the neuron is firing or not.
$Range: {0\;or\; 1}$
$$f(x)= \begin{cases} \begin{align} 0 \ &: \ x_{i} \leq T\\ 1 \ &: \ x_{i} > T\\ \end{align} \end{cases}$$Consider a prediction to watch a movie which is based on the model rating $x$ with mid-level of 0.5.
Hence
$$Watch\;a\;Movie (x) = \begin{cases} \begin{align} No \ &: \ x \leq 0.5\\ Yes \ &: \ x > 0.5 \\ \end{align} \end{cases}$$So if the model tells that a particular movie rating is 0.51 we can watch it and a rating of 0.49 means that we should not watch it as the threshold is at 0.5
However notice that both the values are very close to each other the 0.49 rating seems harsh because it will let users dislike the movie because of just 0.02 difference.
This abrupt jump is caused by step activation function. It has a sharp decision boundery.
The step-function is discontinuous and therefore non-differentiable (its derivative is the Dirac-delta function). Therefore use of this function in practice is mostly not done with back-propagation.
Sigmoid functions are a family of much smoother functions whose curve is similar to the shape of "S" that tend to avoid the sharp decision boundery.
The logistic sigmoid function maps the resulting values between 0 and 1 is of the form.
Logistic function: $f(x)=\frac {1} {1+e^{-x}}$
When x tends to $\infty$, the sigmoid function becomes 1
$e^{−\infty}$ can be written as $\frac {1} {e^{\infty}} = \frac {1} {\infty} \approx 0$ as it tends towards a very small number and hence tends to zero.
clearly $(\frac {1} {1+0})$ is equal to 1.
When x tends to $-\infty$, the sigmoid function becomes 0.
$e^{\infty}$ becomes $\infty$ as $e$ is increasing at a very high rate and hence it is tending towards a very large number
clearly $(\frac {1} {1+\infty})$ is equal to 0 as essentially, 1 divided by a very big number gets very close to zero
When x tends to $0$ sigmoid tends to 0.5
Notice that since zero is a whole number (non negative or non positive) thus $e^{-0}=e^0=1$
clearly $(\frac {1} {1+1}) = 0.5$
Interestingly probability is another quantity that lies between 0 and 1 hence we can also interprect the values given by a sigmoid function as probabilites. Thus we can think in terms of the probability of liking a movie as in previous case.
Sigmoid is mostly used for binary classification problems i.e. outputs values with $Range: (0, 1)$
Thus:
(a) Linear Regression can be understood as an approximation model $$\hat y = w^T x $$ (b) Logistic Regression can be understood as an approximation model $$ \hat y = \frac {1} {1+e^{-(wT x)}}$$
In [5]:
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
y = [(1/(1 + np.exp(-i))) * (1 - (1 /(1 + np.exp(-i)))) for i in x]
return y
x = np.linspace(-10, 10, 100)
print(x)
y1 = sigmoid(x)
y3 = sigmoid_derivative(x)
plt.plot(x, y1, label='sigmoid')
plt.plot(x, y3, label='derivative')
plt.legend(loc='upper left')
plt.show()
Vanishing Gradients - Derivative is zero at both the ends of the curve. Thus during backpropagation through the network with sigmoid activation, the gradients in neurons whose output is near 0 or 1 are nearly 0. These neurons are called saturated neurons. Thus, the weights in these neurons do not update.
Not only that, the weights of neurons connected to such neurons are also slowly updated. This problem is also known as vanishing gradient.
Zero-centered - The sigmoid outputs are not zero-centered. The output is always between 0 and 1, that means that the output after applying sigmoid is always positive. This means that the gradient on the weights w during backpropagation will become either all be positive, or all negative (depending on the gradient of the whole expression $f$).This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. If the gradients are all the same sign, all the weights will either have to increase, or decrease over one iteration. So based on the step length, if you overshoot in the + direction, all weights will have to adjust in the - direction in the next time step
Computationally expensive - The exp() function is computationally expensive compared with the other non-linear activation functions
The problem with non-zero centered output is that since the sign of gradient update for all neurons is the same, all the weights of the layer can either increase or decrease during one update.
However, the ideal gradient weight update might be one where some weights increase while the other weights decrease.
Suppose some weights need to decrease in accordance to an ideal weight update. However, if the gradient update is positive, these weights become too positive in the current iteration. In the next iteration, the gradient may be negative as well as large to remedy these increased weights, which might end up overshooting the weights which need little negative change or a positive change.
This can cause a zig zag patter in search of minima, which can slow down the training.
Tanh or hyperbolic tangent is similar to sigmoid function. It also has shape similar to “S” but its range is from -1 to 1. The advantage of Tanh over Sigmoid function is that the zero inputs will be mapped around zero and negative inputs will be strongly negative i.e, its outputs are zero-centered and therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.
Hyperbolic tangent (shifted and scaled version of the logistic function): $f(x)=\tanh x={\frac {e^{x}-e^{-x}}{e^{x}+e^{-x}}}$
$Range: (-1, 1)$
You can think of a tanh function as two sigmoids put together. In practice, tanh is preferable over sigmoid. The negative inputs considered as strongly negative, zero input values mapped near zero, and the positive inputs regarded as positive.
The tanh function also suffers from the vanishing gradient problem and therefore kills gradients when saturated however the gradient is stronger for tanh than sigmoid (derivatives are steeper).
In [7]:
import numpy as np
import matplotlib.pyplot as plt
def tanh(x):
return np.tanh(x)
def tanh_derivative(x):
return 1.0 - np.tanh(x)**2
x = np.linspace(-10, 10, 100)
print(x)
y1 = tanh(x)
y3 = tanh_derivative(x)
plt.plot(x, y1, label='tanh')
plt.plot(x, y3, label='derivative')
plt.legend(loc='upper left')
plt.show()
ReLU (Rectified Linear Unit) is currently the most used activation function, since it works well for convolutional neural networks and for deep neural networks in general. It computes the function. $$f(x) = max(0,x)$$
$$\begin{split}f(x) = \begin{Bmatrix} 0 & x < 0 \\ x & x >= 0 \end{Bmatrix}\end{split}$$Basically, if the input is less than 0, the output is 0. And if the input is greater than 0, the output equals the input. This activation makes the network converge much faster.
If we're in the region less than zero, the derivative is zero, but if we're in the region greater than zero, the derivative is basically 1.
The derivative here is basically the slope. So if we look at the Relu function at 0.1, we see the output is 1. Similarly, if we look at the Relu input at 2, we see the output is 2. So basically it has a slope of 1, hence that plot.
It does not saturate which means it is resistant to the vanishing gradient problem at least in the positive region ( when x > 0).
The issue with ReLU is that all negative values become zero, which may decrease the ability of model to train properly.
Thus ReLU is:
A ReLU neuron is “dead” if it’s stuck in the negative side and always outputs 0. Because the slope of ReLU in the negative range is also 0, once a neuron gets negative, it’s unlikely for it to recover. Such neurons are not playing any role in discriminating the input and is essentially useless. Over the time we may end up with a large part of your network doing nothing.
In [8]:
import numpy as np
import matplotlib.pyplot as plt
def relu(x, Derivative=False):
if not Derivative:
return np.maximum(0,x)
else:
out = np.ones(x.shape)
out[(x < 0)]=0
return out
x = np.linspace(-10, 10, 100)
print(x)
y1 = relu(x)
y3 = relu(x, Derivative=True)
plt.plot(x, y1, label='relu')
plt.plot(x, y3, label='derivative')
plt.legend(loc='upper left')
plt.show()
Leaky ReLU attempts to fix the problems with ReLU.
$$f(x) = max(0.01x,x)$$ $$\begin{split}f(x) = \begin{Bmatrix} 0.01x & x < 0 \\ x & x >= 0 \end{Bmatrix}\end{split}$$Instead of returning $0$ for $x < 0$, it will have a small negative value, for example $0.01x$
However as it possess linearity, it can’t be used for the complex Classification. It lags behind the Sigmoid and Tanh for some of the use cases.
$Range: (-\infty, +\infty)$
In [9]:
import numpy as np
import matplotlib.pyplot as plt
def Lrelu(x, Derivative=False, epsilon=0.1):
if not Derivative:
return np.maximum(epsilon * x, x)
else:
gradients = 1. * (x > 0)
gradients[gradients == 0] = epsilon
return gradients
x = np.linspace(-10, 10, 100)
print(x)
y1 = Lrelu(x)
y3 = Lrelu(x, Derivative=True)
plt.plot(x, y1, label='leaky relu')
plt.plot(x, y3, label='derivative')
plt.legend(loc='upper left')
plt.show()
The idea of leaky ReLU can be extended even further. Instead of multiplying $x$ with a constant term we can multiply it with a hyperparameter which seems to work better the leaky ReLU. This extension to leaky ReLU is known as Parametric ReLU.
$$ f(x) = max(\alpha x, x)$$$$\begin{split}f(x) = \begin{Bmatrix} \alpha x & x < 0 \\ x & x >= 0 \end{Bmatrix}\end{split}$$Where $\alpha$ is a hyperparameter (learnable parameter). The idea here was to introduce an arbitrary hyperparameter $\alpha$, and this $\alpha$ can be learned since you can backpropagate into it. This gives the neurons the ability to choose what slope is best in the negative region, and with this ability, they can become a ReLU or a leaky ReLU.
$Range: (-\infty, +\infty)$
In RReLU, the slopes of negative parts are randomized in a given range in the training, and then fixed in the testing.
Applies the randomized leaky rectified liner unit function, element-wise,
$$\begin{split}f(x) = \begin{Bmatrix} ax & x < 0 \\ x & x >= 0 \end{Bmatrix}\end{split}$$where $a$ is randomly sampled from uniform distribution $\;\mathcal{U}(\text{lower}, \text{upper})$
For Parametric ReLU, $\alpha$ is learned and for Leaky ReLU $\alpha$ is fixed. For RReLU, $\alpha$ is a random variable keeps sampling in a given range, and remains fixed in testing.
Similar to leaky ReLU, ELU has a small slope for negative values. Instead of a straight line, it uses a log curve
It tend to converge cost to zero faster and produce more accurate results. ELU has a extra alpha constant (hyper-parameter which has to be tuned) which should be positive number.
ELU is very similiar to RELU except negative inputs. They are both in identity function form for non-negative inputs. On the other hand, ELU becomes smooth slowly until its output equal to -α whereas RELU sharply smoothes.
$$\begin{split}ELU(z) = \begin{Bmatrix} z & z > 0 \\ α.( e^z – 1) & z <= 0 \end{Bmatrix}\end{split}$$And its derivative is given by:
$$\begin{split}ELU'(z) = \begin{Bmatrix} 1 & z>0 \\ α.e^z & z<0 \end{Bmatrix}\end{split}$$For x > 0, it can blow up the activation with the output range of [0, inf].
$Range: (-\alpha, +\infty)$
It is designed to combine the good parts of ReLU and leaky ReLU — while it doesn’t have the dying ReLU problem, it saturates for large negative values, allowing them to be essentially inactive.
In [10]:
import numpy as np
import matplotlib.pyplot as plt
def elu(z,Derivative=False, alpha=1):
if not Derivative:
return np.where(z < 0, alpha * (np.exp(z) - 1), z)
else:
out = np.where(z < 0, alpha*(np.exp(z)), z)
out[(x > 0)]=1
return out
x = np.linspace(-10, 10, 100)
print(x)
y1 = elu(x)
y3 = elu(x, Derivative=True)
plt.plot(x, y1, label='elu')
plt.plot(x, y3, label='derivative')
plt.legend(loc='upper left')
plt.show()
Scaled Exponential Linear Unit (SELU)
$$\begin{split}SELU(x) = \lambda \begin{Bmatrix} x & x \ge 0 \\ α.( e^x – 1) & otherwise \end{Bmatrix}\end{split}$$$$Range: (-\lambda \alpha, +\infty)$$This activation function is introduced in the article “Self-normalizing Neural Networks” (2017, Sepp Hochreiter). One of the few examples where they prove properties of their activation function. They can show that on average their is no vanishing (or exploding) gradient problem (for magic numbers for $\alpha$ and $\lambda$).
S-shaped Rectified Linear Activation Unit (SReLU)
$Range: (-\infty, +\infty)$
CReLU (Concetenated ReLU)
Concatenated ReLU has two outputs, one ReLU and one negative ReLU, concatenated together. In other words, for positive x it produces [x, 0], and for negative x it produces [0, x]. Because it has two outputs, CReLU doubles the output dimension
ReLU6
ReLU capped at 6.
It was first used in this paper for CIFAR-10, and 6 is an arbitrary choice that worked well. According to the authors, the upper bound encouraged their model to learn sparse features earlier.
CELU
Exponential Linear Units (ELUs) is not continuously differentiable with respect to its input when the shape parameter alpha is not equal to 1. An alternative parametrization which is C1 continuous for all values of alpha, making the rectifier easier to reason about and making alpha easier to tune. This alternative parametrization has several other useful properties that the original parametrization of ELU does not:
Generally, we summarize the advantages and potential problems of ReLUs:
Some other types of units that do not have the functional form $f(w^T x+b)$ where a non-linearity is applied on the dot product between the weights and the data. One relatively popular choice is the Maxout neuron that generalizes the ReLU and its leaky version.
The Maxout neuron computes the function $$max(w^T_1+b_1,w^T_2+b_2)$$
Notice that both ReLU and Leaky ReLU are a special case of this form (for example, for ReLU we have $w_1,b_1=0$).
$$max(0,w^T_2+b_2)$$
The Maxout neuron therefore enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU).
However, unlike the ReLU neurons it doubles the number of parameters for every single neuron, leading to a high total number of parameters.
Read paper from Ian Goodfellow et al. titled "Maxout Networks".
SoftPlus — The derivative of the softplus function is the logistic function. ReLU and Softplus are largely similar, except near 0(zero) where the softplus is enticingly smooth and differentiable. It’s much easier and efficient to compute ReLU and its derivative than for the softplus function which has log(.) and exp(.) in its formulation.
$$f(x) = ln(1+e^x)$$$$Range: (0, \infty)$$Derivative of the softplus function is the logistic function.
$$ f'(x) = \frac {1}{1+e^{-x}}$$
In [11]:
def safe_softplus(x):
# Use this function to avoid overflow
inRanges = (x < 100)
return np.log(1 + np.exp(x*inRanges))*inRanges + x*(1-inRanges)
def softplus(x, Derivative=False):
if not Derivative:
return np.log(1 + np.exp(x))
else:
return 1 / (1 + np.exp(-x))
x = np.linspace(-10, 10, 100)
print(x)
y1 = softplus(x)
y3 = softplus(x,True)
plt.plot(x, y1, label='softplus')
plt.plot(x, y3, label='derivative')
plt.legend(loc='upper left')
plt.show()
Gaussian function is an even function, thus is gives the same output for equally positive and negative values of input. It gives its maximal output when there is no input and has decreasing output with increasing distance from zero. We can perhaps imagine this function is used in a node where the input feature is less likely to contribute to the final result.
$f\left( x_{i}\right ) = e^{ -x_{i}^{2}}, \\ f^{\prime}\left( x_{i}\right ) = - 2x e^{ - x_{i}^{2}}$
In [12]:
def gaussian(x, Derivative=False):
if not Derivative:
return np.exp(-x**2)
else:
return -2 * x * np.exp(-x**2)
x = np.linspace(-10, 10, 100)
print(x)
y1 = gaussian(x)
y3 = gaussian(x, Derivative=True)
plt.plot(x, y1, label='lgaussian')
plt.plot(x, y3, label='derivative')
plt.legend(loc='upper left')
plt.show()
While sigmoid function used for binary classification in logistic regression model, Softmax is used for multi-classification in logistic regression model.
Softmax prediction converts actual distances into probabilities or normalizies kind of smoothens them.
It calculates the probabilities distribution of the event over ‘n’ different events.In general way of saying, this function will calculate the probabilities of each target class over all possible target classes. Later the calculated probabilities will be helpful for determining the target class for the given inputs.
Thus softmax functions convert a raw value into a posterior probability. This provides a measure of certainty. It squashes the outputs of each unit to be between 0 and 1.
The output of the softmax function is equivalent to a categorical probability distribution, it tells you the probability that any of the classes are true.
Softmax is extremely good at picking a single label. While sigmoid function simply differentiates between two classes. So, when you train a separate classifier for each label you may end up with a few positive values for a few different labels.
Thus softmax() helps when we want a probability distribution, which sums up to 1. Sigmoid is used when you want the output to be ranging from 0 to 1, but need not sum to 1.
And if we wish to classify and choose between two alternatives it is recommended to use softmax() as that will get a probability distribution on which we can apply cross entropy loss function.
In general given the number of training examples with class labels, we convert the integer class coding into a one-hot representation and softmax is used on the weighted scores to get the probabilities.
Two key differences though between softmax and sigmoid are
Digging deep, we can also use sigmoid for multi-class classification. When you use a softmax, basically you get a probability of each class, (join distribution and a multinomial likelihood) whose sum is bound to be one.
And increasing the output value of one class makes the the others go down (sigma=1). So, sigmoids can probably be preferred over softmax when the outputs are independent of one another. To put it more simple, if there are multiple classes and each input can belong to exactly one class, then it absolutely makes sense to use softmax, in the other cases, sigmoid seems better.
Let’s say, we have three classes {class-1, class-2, class-3} and scores of an item for each class is [1, 7, 2].
Hardmax assigns the probability [0, 1, 0] where as softmax assigns probability [0.1, 0.7, 0.2]. Hence, softmax predicts softly (with probability 0.7) that item belongs to class-2 whereas hardmax predicts hardly (with probability 1) that item belongs to class-2.
In the binary classification both sigmoid and softmax function are the same where as in the multi-class classification we use Softmax function.
Hence use sigmoid for binary classification and softmax for multiclass classification. If the number of classes is 2, then softmax is the same as the sigmoid function.
In [13]:
import numpy as np
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0) # We want to sum row-wise, hence axis=0,
scores = [3.0, 1.0, 0.2]
print(softmax(scores),"\n")
scores2D = np.array([[1, 2, 3, 6],
[2, 4, 5, 6],
[3, 8, 7, 6]])
print(softmax(scores2D))
Adaptive Log Softmax With Loss is an approximate strategy for training models with large output spaces. It is most effective when the label distribution is highly imbalanced, for example in natural language modelling, where the word frequency distribution approximately follows the Zipf’s law.
LogSigmoid applies element-wise $$Log\;Sigmoid (x) = \log \left(\frac{1}{1 + \exp(-x_i)}\right)$$
LogSoftmax applies a softmax followed by a logarithm. While mathematically equivalent to $log(softmax(x))$, doing these two operations separately is slower, and numerically unstable. This function uses an alternative formulation to compute the output and gradient correctly.
Softmin applies $$Softmin(x)=Softmax(−x)$$
Activations are just scalar values. Activation functions are thus
Nonlinear means that the output of the function varies nonlinearly with the input. Non-linear functions have degree more than one and they have a curvature when we plot a Nonlinear function. They introduce non-linear properties to the Network.
Continuity of the functions implies that there are no sharp peaks or gaps in the function, so that they can be differentiated throughout, making it possible to implement the delta rule to adjust both input-hidden and hidden-output layer weights in backpropagation of errors.
The term “bounded” means that the output never reaches very large values, regardless of the input.
Activation functions share a common attribute. Their derivative can be easily computed, which is used during learning to find the slope of the curve. The slope is needed to know in which direction and how much to change the curve to find the optimal values for weights and biases. By using activation functions with easily computed derivatives, we can save some computations.
To sumarize:
Problem with the tanh and sigmoid activation function is their derivative is near zero in many regions correct. Also the slope of their curve gets low for $x$ values far from $0$, which can slow down the learning.
Neither ReLU nor Leaky ReLU have this problem for $x > 0$.
When choosing an activation function, ReLU could be a default choice. However, due to the dying ReLU problem, Leaky ReLU might be worth trying as it might improve results. The dead ReLU always outputs same value, which is 0. The ReLU might end up in this state after its weights were updated in such way that it will never activate again. Once it ends up in this state, it is unlikely to recover because the function’s gradient for $0$ is also $0$, so the weights of the neuron will not be changed. The inputs of a dead ReLU are still being updated via other neurons, so the dead ReLU can be revived through updates to the previous layer.
The point at zero for ReLU(rectified linear unit) function has no slope, ie. not increasing or decreasing; thus the derivative is also zero.
For a function $f$ to have a derivative it is necessary for the function f to be continuous, but continuity alone is not sufficient as in the Relu. The derivative of Relu at zero is undefined. Why? As you approach x=0 from the left, the "left limit" will be zero, as you approach x=0 from the right, the "right limit" will be 1. So because of this inconsistency (left and right derivatives not being equal), mathematically the derivative of Relu at x=0 is not defined. However, in practice and in most implementations it is taken as zero.
Consider
If we have to predict the cost of fitting just an AC to a car, then we can do $$(\$12000+\$300)/2=\$6150$$
But this would mean that if we equip a Maruthi with AC it will cost itself more than the price of the car itself.
Alternatively, what if we ask by how much a car’s value will increase relative to its base price, but the above model is unable to answer this question
Log-scale informs on relative changes (multiplicative), while linear-scale informs on absolute changes (additive). When do you use each?
When you care about relative changes, use the log-scale; when you care about absolute changes, use linear-scale. This is true for distributions, but also for any quantity or changes in quantities.
If we're trying to model something, and the mechanism acts via a relative change, log-scale is critical to capturing the behavior seen in our data. But if the underlying model's mechanism is additive, we'll want to use linear-scale.
Stock value of Qualcomm on day 1: $ \$100 $
Stock value of Qualcomm on day 2, $ \$101 $
We report this change in two ways!
Stock value of AMD goes from $ \$1 $ to $ \$1.10 $
Stock value of Qualcomm goes from $ \$100 $ to $ \$110 $
Based on what change will you invest your money?
If we convert to log space,
Stock value of AMD goes from $$\log_{10}($1) \;to\; \log_{10}($1.10) = 0 \;to\; .0413$$
Stock value of Qualcomm goes from $$\log_{10}($100) \;to\; log_{10}($110) = 2 \;to\; 2.0413$$
Now, taking the absolute difference in log space, we find that both changed by .0413.
For two stocks whose mean value is different but whose relative change is identically distributed (they have the same distribution of daily percent changes), their log distributions will be identical in shape just shifted. Conversely, their linear distributions will not be identical in shape, with the higher valued distribution having a higher variance.
If we were to look at these same distributions in linear, or absolute space, we would think that higher-valued share prices correspond to greater fluctuations.
For our investing purposes though, where only relative gains matter, this is not necessarily true as shown below.
Yesterday
Today they both went up by one dollar to $\$2$ and $\$101$ respectively.
Which company would you put your $\$100$?
If you invested yesterday you'd have made $\$200$ with AMD, or $\$101$ with qualcomm. So here we "care" about the relative gains. If we convert to log space,
Stock value of AMD goes from $$\log_{10}($1) \;to\; \log_{10}($2) = 0 \;to\; 0.301$$
Stock value of Qualcomm goes from $$\log_{10}($100) \;to\; log_{10}($101) = 2 \;to\; 2.0043$$
Hence AMD has the highest relative gain as shown by the absolute log differences.
Probability
Any interval (a,b) contains a continuum of real numbers, which is why you can zoom in on an interval forever and there will still be an interval there. Calculus concepts like limits and continuity depend on the mathematics of the continuum.
In discrete mathematics, we are concerned with properties of numbers especially counting.
Probability is the likelihood that something will occur or not.
Rules:
Probability of an Event
If $E$ is an event in a finite, nonempty sample space $S$ of equally likely outcomes, then the probability of the event $E$ is
$$ P(E) = \frac { the\; number\; of\; outcomes\; in\; E}{the\; number\; of\; outcomes\; in\; S }$$Here sample space $S$ is nothing but the total number of outcomes. The hypothesis of equally likely outcomes is critical here.
To put in a simple way.
$$ P(E) = \frac {No. \;of\; possibilities\; that \;meet \;the \;conditions} {No. \;of \;all\; equally \;likely \;possibilites}$$Multiplication Principle
Suppose an event $A$ has probability and an event $B$ has probability under the assumption that $A$ occurs. Then the probability that both $A$ and $B$ occur is $p1p2$.
If the events $A$ and $B$ are independent, we can omit the phrase “under the assumption that $A$ occurs,” since that assumption would not matter.
We will often be interested in finding probabilities involving multiple events such as
Conditional Probability
The above Multiplication Principle of Probability can be stated succinctly with this notation as follows:
$$P (A \;and\; B) = P(A) \cdot P(B|A)$$Where $P(B|A)$ is read probability of $B$ given that event $A$ has already occured.
Thus the conditional probability formula is given as: If the event B depends on the event A, then
$$P(B|A) = \frac {P (A \;and\; B) } {P(A)}$$When A and B are independent, then $$P(A \;and\; B) = P(A) \cdot P(B)\\$$
As an example imagine a bag of marbles containing 3 green marbles and 2 red marbles. Now the probability of picking a green marble the 1st time and another green marble again the next time is given by:
$$P(1st\; Green + 2nd\; Green) = 3/5 + 2/4 = 3/10 = 0.30 = 30\%\\$$Expectation
If the outcomes of an experiment are given numerical values (such as the total on a roll of two dice, or the payoff on a lottery ticket), we define the expected value to be the sum of all the numerical values times their respective probabilities.
For example, suppose we roll a fair die. If we roll a multiple of 3, we win $\$3$; otherwise we **lose** $\$1$. We want to decide if this is a reasonable game to play.
To do so we calculate the probabilities of the two possible payoffs are shown in the table below.
$$\begin{array}{|c|c|} \hline \; Information\;Value \;& \;Probability\; \\\hline \; +3 \;& \;2/6 \; \\\hline \; -1 \;& \;4/6 \; \\\hline \end{array}$$The expected value is $$ (3) \times (2/6) + (-1) \times (4/6) = 1/3$$
We interpret this to mean that we would win an average of 1/3 dollar per game in the long run.
The general idea of expected value is that we have some function that assigns a number to every member of a sample space. And expected value of the function is just the sum of its values for each member of the sample space weighted by probability.
So for any random variable $X$ defined over sample space $x_1, x_2, ... x_n$:
$$E(X) = p(x_1)X(x_1)+p(x_2)X(x_2)+\cdots+p(x_n)X(x_n) \\ = \sum_{x=1}^n p(x_i) X(x_i)$$Variance
The usefulness of the expected value as a prediction for the outcome of an ex- periment is increased when the outcome is not likely to deviate too much from the expected value.
A measure of this deviation, called the variance $\sigma^2 = V(X)= E((X - \mu)^2) $ where $X$ is a numerically valued random variable with expected value $\mu = E(X)$
Standard Deviation
The standard deviation $\sigma = \sqrt{V(X)}$
Information
Imagine two people Rahul and Bill Gates living in India and USA respectively.
Notice that Rahul’s actions give information about the weather in India.
Bill’s actions give no information.
This is because Rahul’s actions are random and correlated with the weather in India, whereas Bill’s actions are deterministic.
Assume Rahul is playing cricket today, suddenly it starts raining (totally independently), and what we are observing is an unexpected event—it is "surprising".
How can we quantify this notion of unexpected information?
Intuitively, more information is gained from observing an unexpected event—it is "surprising". And measurement of rarer events will yield more information content (which was unknown till that time) than more common values.
For example, if there is a one-in-a-million chance of Rahul winning the lottery, his friend Bill will gain significantly more information from learning that he won than that he lost on a given day.
Entropy
We're interested in assigning a number to an event that characterizes the quantity of information it carries.
It thus satisfies two probabilistic criteria:
(b) Additivity - The information content of two independent events is the sum of each event's information content $$ I(x_1 x_2) = I(x_1) + I(x_2) $$
A more fruitful observation is this: the knowledge that two independent events happened should be the sum of the knowledge of each one. Think of paying someone for information about two events. If they are independent, then the amount you would pay for the information about both should be the sum of the amounts you'd pay for each.
In mathematics, it can be proved that the only functions that have the above property are logarithmic functions:
$$\log_b(RS)=\log_bR+\log_bS \\$$Thus $$I(x_1)=\log_b \frac{1}{P(x_1)}=-\log_b P(x_1)\\$$
We're interested in assigning a number to an event that characterizes the quantity of information it carries and the above is the number.
Thus, we define entropy as the expected information quantity.
I (Information quantity) is a function that returns a number for each member of a sample space of possible events. We can compute the expected value of I, which we call H:
Thus, H: expected value of I is given by
$$ H(X) = P(x_1) \cdot (-\log_bP(x_1)) + P(x_2) \cdot (-\log_bP(x_2)) +\cdots +P(x_n) * (-\log_bP(x_n))$$$$H(X) = -\sum_{x=1}^n P(x_i) \log_bP(x_i)$$H(X) is called the entropy or expected information value of X.
where:
The entropy measures the expected uncertainty in $X$. We also say that $H(X)$ is approximately equal to how much information we learn on average from one instance of the random variable $X$
Entropy is the weighted-average log probability over possible events — this much reads directly from the equation — which measures the uncertainty inherent in their probability distribution. The higher the entropy, the less certain we are about the value we're going to get.
The probability distribution of the events, coupled with the information amount of every event, forms a random variable whose expected value is the average amount of information, or entropy, generated by this distribution.
Entropy is a measure of unpredictability of information content.
Cross Entropy
Given a classification problem, you know the true or actual distribution $p$. And you calculate the predicted distributions $q$.
According to the predictions, the information content of every event is going to be $-\log_b q(x_i)$ because this is what we predicted.
But to calculate the information content of the actual or true values, we compute these over $p(x_i)$.
Hence the cross entropy between two probability distributions p and q is defined as $$H(p,q) = -\sum_{x=1}^n p(x_i) \log_bq(x_i) $$
Suppose for a specific training instance, the label is B (out of the possible labels A, B, and C). The one-hot distribution for this training instance is therefore:
$$ \begin{array}{c|lcr} T & \text{Pr(Class A) } & \text{Pr(Class B) } & \text{Pr(Class C) } \\ \hline 1 & 0.0 & 1.0 & 0.0 \\ \end{array} $$We can interpret the above "true" distribution to mean that the training instance has 0% probability of being class A, 100% probability of being class B, and 0% probability of being class C.
Now, suppose your machine learning algorithm predicts the following probability distribution:
$$ \begin{array}{c|lcr} P & \text{Pr(Class A) } & \text{Pr(Class B) } & \text{Pr(Class C) } \\ \hline 1 & 0.228 & 0.619 & 0.153 \\ \end{array} $$As per the above equation p(x) is the true probability, and q(x) the predicted probability. The sum is over the three classes A, B, and C. In this case the loss is 0.479 :
$$H = - (0.0*ln(0.228) + 1.0*ln(0.619) + 0.0*ln(0.153)) = 0.479$$So that is how "wrong" or "far away" your prediction is from the true distribution.
Consider another classification problem with three possible outcomes, such as "democrat", "republican", "independent" and the actual outcome for a training item is the middle value, "republican" i.e, all the probability mass is concentrated on the guy who will win (all the mass is allocated on the true label)
Consider here
So that is how "wrong" or "far away" our prediction is from the true distribution.
Notice that for a classification problem, all the actual probabilities will be 0 except for one probability, which will have value 1. So all the terms in the equation will drop out except one. Also, because of the negative sign, cross entropy error will always be positive (or zero).
Hence the loss function cross entropy loss which is measure of how good (as close to the prediction value) the prediction was.
If the prediction is very close to the true distribution then cross entropy will be low or minimized.
Objective function is a more general term used to represent any function that is used for optimization during training. A loss function is treated as being part of a cost function which itself is a type of an objective function.
This is because the objective of the network is to minimize the cost function.
Loss function is usually a square error function defined on a data point, prediction and label whereas the cost function is loss function averaged over all training examples.
Distance functions are often used to represent loss or cost functions.
torch.nn also provides a number of loss functions that are naturally important to machine learning applications.
Classification Loss -
Regression Loss -
In PyTorch jargon, loss functions are often called criterions. Criterions are really just simple modules that you can parameterize upon construction and then use as plain functions from there on.
The models in general differ in the type of response variable they predict, i.e. the y.
In each model, the response variable can take on a bunch of different values. In other words, they are random variables and some of these require us to use losses like cross entropy loss.
Classification problems which require say an yes or no in the output; like saying if a ball is present in an image, cannot be expected to have any meaning if the architecture predicts some number say 678. It has to predict some probability or certainity or uncertainity.
The problem with using the Mean Square Error and Logistic Regression is that the loss or cost surface is flat in some regions.
Also if we use Mean Square Error and you select a bad initialization value for our parameters, at that point, the cost or total loss surface will be flat and the parameter will not update correctly.
The advantage of the cross-entropy loss over the Mean Square Error there are contours all over the cost or total loss surface; and as a result, the algorithm will converge to a local minimum correctly.
If we have very popular classes as well as very 'rare' classes, the log loss generally handles this situation a little bit better (because it uses the log) compared to the Brier score/MSE.
L1Loss creates a criterion that measures the mean absolute error (MAE) between each element in the Input/Prediction x and target y
By default, the losses are averaged or summed over observations for each minibatch.
In [0]:
# L1Loss
import numpy as np
import torch
import torch.nn as nn
predicted = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
print(predicted,"\n")
print(target,"\n")
print("Torch L1Loss unreduced: ",nn.L1Loss(reduction='none')(predicted, target),"\n")
print("Torch L1Loss: ",nn.L1Loss()(predicted, target),"\n")
print(abs(predicted.detach().numpy() - target.numpy()),"\n")
print(abs(predicted.detach().numpy() - target.numpy()).mean())
In [0]:
# MSELoss
print(nn.MSELoss(reduction='none')(predicted, target),"\n")
print(nn.MSELoss()(predicted, target),"\n")
print((predicted.detach().numpy() - target.numpy())**2,"\n")
print(((predicted.detach().numpy() - target.numpy())**2).mean())
Pytorch doesnt compute the Cross Entropy as per
$$H(p,q) = -\sum_{x=1}^n p(x_i) \log_bq(x_i) $$Say given predicted probability x = [0, 0, 0, 1] and target class at index [3]
PyTorch requires them to be first converted into probabilities for which it uses the softmax function.
Hence it converts to $$softmax(x) = softmax[0,0,0,1] = [0.1749,0.1749,0.1749,0.4754] $$
$$H(p,q) = -1 * log(0.4754) = 0.7437$$Logits here is understood to mean that the function operates on the unscaled output of earlier layers and that the relative scale to understand the units is linear. It means, in particular, the sum of the inputs may not equal 1, that the values are not probabilities (you might have an input of 5).
The softmax "squishes" the inputs so that sum(input) = 1: it's a way of normalizing. The shape of output of a softmax is the same as the input: it just normalizes the values. The outputs of softmax can be interpreted as probabilities.
The softmax function, interprets the input as unnormalized log probabilities (aka logits) and outputs normalized linear probabilities.
Cross-entropy loss can operate on a row-wise basis.
This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class. It is useful when training a classification problem with C classes.
If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.
Weight argument: If we want to make classification for one of the classes more stronger say we want to very accurately classify the zeros, whereas we can have some sort of error between other classes, say some number that is written as 7 wrongly gets classified as 2 then its not so erronous, but if the same number 7 is wrongly classified as zero then its highly erronous.
In digit classification all classes ofcourse are equally important so we tend to put down equal weight to each of these classes.
Although it can be applied to cancer detection as each class may not equivocal or equally valued and some classes might have more importance than other classes. In such cases we make use of weights.
The input or prediction is expected to contain scores for each class.
This criterion expects a class index (0 to C-1) as the target for each value of a 1D tensor of size minibatch.
The losses are averaged across observations for each minibatch.
In case of K-dimensional loss (say 2d-Images)
where $N = minibatch$
Internally pytorch considers natural logarithm use a different formula.
Softmax is not a loss function, nor is it really an activation function. It has a very specific task: It is used for multi-class classification to normalize the scores for the given classes. By doing so we get probabilities for each class that sum up to 1.
Softmax is combined with Cross-Entropy-Loss to calculate the loss of a model. Unfortunately, because this combination is so common, it is often abbreviated. Some are using the term Softmax-Loss, whereas PyTorch calls it only Cross-Entropy-Loss.
The target type is torch.LongTensor, and the label is in the form of one-hot encoding, ie only one position is 1, and other positions are all 0. Also we dont use it for multi-label scenarios.
$C$ is the number to be classified. It is the dimension of the vector that represents the weight on the label.
In [0]:
# Single class prediction
# Predicted output = [0.8982 0.805 0.6393 0.9983 0.5731 0.0469 0.556 0.1476 0.8404 0.5544]
# Assume we have 10 classes, from 0, 1, to 9. Given for the above class predictions only class is correct.
# Target class: [1]
# 1x10 Tensor: e.g. prediction for each classes.
# Then:
# loss = math.log(sum(np.exp(output))) - output[target[0]] = 2.948818 - 0.805 = 2.143818
import numpy as np
import torch
import torch.nn as nn
import math
# Predicted class probabilities
predicted = torch.tensor([0.8982,0.805,0.6393,0.9983,0.5731,0.0469,0.556,0.1476,0.8404,0.5544]).view(1,-1)
print(predicted)
# The actual labels are given as - Only one label is given that corresponds to element at index [1]
target = torch.LongTensor([1])
print(target)
xi = predicted.detach().numpy()
yt = target.numpy()
# Torch Method
print("Torch Cross Entropy Loss with no reduction: ",nn.CrossEntropyLoss(reduction='none')(predicted,target),"\n")
print("Torch Cross Entropy Loss: ",nn.CrossEntropyLoss()(predicted,target),"\n")
# Manual Method-1
predicted = [0.8982,0.805,0.6393,0.9983,0.5731,0.0469,0.556,0.1476,0.8404,0.5544]
target = [1]
print("Manual Loss: ",math.log(sum(np.exp(predicted))) - predicted[target[0]])
# Manual Method-2
lstc = []
for k in range(len(xi)):
lstc.append(-np.log(np.exp(xi[k][yt[k]]) / np.exp(xi[k]).sum()))
print(lstc, ": ",np.mean(lstc))
In [0]:
# Two predictions assume you have a minibatch of size 2, i.e, 2 batches of data each with 10 probabilities
# Each batch has a list of possible probabilities
# prediction should be (batch_size, n_label) and target should be (batch_size) with values in [0, n_label-1].
import numpy as np
import torch
import torch.nn as nn
import math
# Predicted class probabilities - 2x10 Tensor: two predictions at one time
predicted = torch.tensor([[0.8982,0.805,0.6393,0.9983,0.5731,0.0469,0.556,0.1476,0.8404,0.5544],
[0.9457,0.0195,0.9846,0.3231,0.1605,0.3143,0.9508,0.2762,0.7276,0.4332]])
print(predicted)
# Actual prediction is at index[1] and index[5]
target = torch.LongTensor([1,5])
print(target)
# Then:
# loss1 = math.log(sum(np.exp(predicted))) - predicted[target[0]] = 2.948818 - 0.805 = 2.143818
# loss2 = math.log(sum(np.exp(predicted))) - predicted[target[0]] = 2.874357 - 0.3143 = 2.560057
# loss = (loss1 + loss2)/2 = 2.351938
print("Torch Cross Entropy Loss with no reduction: ",nn.CrossEntropyLoss(reduction='none')(predicted,target),"\n")
print("Torch Cross Entropy Loss: ",nn.CrossEntropyLoss()(predicted,target),"\n")
In [0]:
# CTC - The Connectionist Temporal Classification loss
log_probs = torch.randn(50, 16, 20).log_softmax(2).detach().requires_grad_()
targets = torch.randint(1, 20, (16, 30), dtype=torch.long)
input_lengths = torch.full((16,), 50, dtype=torch.long)
target_lengths = torch.randint(10,30,(16,), dtype=torch.long)
print(log_probs, "\n", targets,"\n", input_lengths,"\n", target_lengths,"\n")
loss = nn.CTCLoss()(log_probs, targets, input_lengths, target_lengths)
print(loss)
Negative log likelihood loss (NLLLoss) is useful to train a classification problem with C classes.
If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set. Weight factor associated with loss of a class to give importance for classes of unequal importance.
The input given through a forward call is expected to contain log-probabilities of each class.
Obtaining log-probabilities in a neural network is easily achieved by adding a LogSoftmax layer in the last layer of your network. You may use CrossEntropyLoss instead, if you prefer not to add an extra layer.
The target that this loss expects is a class index (0 to C-1, where C = number of classes)
The softmax activation function is often placed at the output layer of a neural network. It’s commonly used in multi-class learning problems where a set of features can be related to one-of-K classes.
Intuitively, what the softmax does is that it squashes a vector of size K between 0 and 1. Furthermore, because it is a normalization of the exponential, the sum of this whole vector equates to 1.
We can then interpret the output of the softmax as the probabilities that a certain set of features belongs to a certain class.
Thus, given a three-class example below, the scores yi are computed from the forward propagation of the network. We then take the softmax and obtain the probabilities as shown.
The output of the softmax describes the probability (or if you may, the confidence) of the neural network that a particular sample belongs to a certain class. Thus, for the first example above, the neural network assigns a confidence of 0.71 that it is a cat, 0.26 that it is a dog, and 0.04 that it is a horse. The same goes for each of the samples above.
We can then see that one advantage of using the softmax at the output layer is that it improves the interpretability of the neural network. By looking at the softmax output in terms of the network’s confidence, we can then reason about the behavior of our model.
In practice, the softmax function is used in tandem with the negative log-likelihood (NLL) which is summed for all the correct classes. We can interpret the loss as the “unhappiness” of the network with respect to its parameters. The higher the loss, the higher the unhappiness.
The negative log-likelihood becomes unhappy at smaller values, where it can reach infinite unhappiness (that’s too sad), and becomes less unhappy at larger values. Because we are summing the loss function to all the correct classes, what’s actually happening is that whenever the network assigns high confidence at the correct class, the unhappiness is low, but when the network assigns low confidence at the correct class, the unhappiness is high.
In [0]:
# NLLLoss
import numpy as np
import torch
import torch.nn as nn
# input is of size Batches x Class Probability (N x C = 3 x 3)
predicted = torch.tensor([[5.0,4.0,2.0],[4.0,2.0,8.0],[4.0,4.0,1.0]])
# Actual probabilites are at the below indexes. Each element in target has to have 0 <= value < C
target = torch.tensor([0, 2, 1])
print("Tensor: ",predicted,"\n")
# Top to bottom is dim0 and left to right is dim1
print("Softmax: ",nn.Softmax(dim=1)(predicted),"\n")
print("Log(Softmax): ",nn.LogSoftmax(dim=1)(predicted),"\n")
# NLLLoss at index of correct class probabilities
result = nn.NLLLoss(reduction='none')(nn.LogSoftmax(dim=1)(predicted), target)
print("NLLLoss(Log(Softmax)) with no reduction: ",result,"\n")
result = nn.NLLLoss()(nn.LogSoftmax(dim=1)(predicted), target)
print("NLLLoss(Log(Softmax)): ",result,"\n")
# 2D loss example (used, for example, with image inputs)
N, C = 5, 4
# input is of size N x C x height x width
data = torch.randn(N, 16, 10, 10)
conv = nn.Conv2d(16, C, (3, 3))
# each element in target has to have 0 <= value < C
target = torch.empty(N, 8, 8, dtype=torch.long).random_(0, C)
result1 = nn.NLLLoss()(nn.LogSoftmax(dim=1)(conv(data)), target)
#print(result1,"\n")
In [0]:
# LogSoftmax
x = torch.tensor([[5.0,4.0,2.0],[4.0,2.0,8.0],[4.0,4.0,1.0]])
x = x.numpy()
listl = []
for k in range(len(x)):
listl.append(np.log( np.exp(x[k]) / np.exp(x[k]).sum()))
print("Manual log(Softmax): ",listl,"\n")
# NLLLoss
x0 = torch.tensor([[5.0,4.0,2.0],[4.0,2.0,8.0],[4.0,4.0,1.0]])
x = nn.LogSoftmax(dim=1)(x0)
y = torch.tensor([0, 2, 1])
x = x.numpy()
y = y.numpy()
lst = []
for k in range(len(x)):
lst.append(-x[k][y[k]])
print("Manual NLLLoss(log(softmax)): ",lst,"\n")
print("Manual NLLLoss(log(softmax)): ",np.mean(lst),"\n")
PoissonNLLLoss is Negative log likelihood loss with Poisson distribution of target.
For a given target (Random Variable) in a Poisson distribution, the function calculates the Negative Log likelihood loss.
The loss can be described as: $$target∼Poisson(input)loss(input,target)=\\input−target∗log(input)+log(target!)$$
The last term can be omitted or approximated with Stirling formula. The approximation is used for target values more than 1. For targets less or equal to 1 zeros are added to the loss.
In [0]:
log_input = torch.randn(5, 2, requires_grad=True)
target = torch.randn(5, 2)
print("Log input: ",log_input,"\n")
print("Target: ",target,"\n")
result = nn.PoissonNLLLoss()(log_input, target)
print(result)
In [0]:
x = log_input.detach().numpy()
y = target.numpy()
print("Log input: ",x,"\n")
print("Target: ",y,"\n")
# target∗log(target)−target+0.5∗log(2πtarget)
def sterling_approx(y):
return y*np.log(y) - y + 0.5*np.log(np.pi*y)
lst = []
for k in range(len(x)):
lsti = []
for i in range(len(x[k])):
lss = np.exp(x[k,i])-y[k,i]*x[k,i] + (sterling_approx(y[k,i]) if y[k,i]>1 else 0)
lsti.append(lss)
lst.append(lsti)
print(np.array(lst))
print(np.mean(lst))
KL divergence is a useful distance measure for continuous distributions and is often useful when performing direct regression over the space of (discretely sampled) continuous output distributions.
The KL divergence is simply the difference between cross entropy and entropy. It is measure of dissimilarity between two distributions:
This means that, the closer p(y) gets to q(y), the lower the divergence and, consequently, the cross-entropy, will be.
So a classifier looks for good (best) value of p(y) which is the one that minimizes the cross-entropy.
As with NLLLoss, the input given is expected to contain log-probabilities. However, unlike NLLLoss, input is not restricted to a 2D Tensor. The targets are given as probabilities (i.e. without taking the logarithm).
This criterion expects a target Tensor of the same size as the input Tensor.
In [0]:
predicted = torch.rand(2, 3)
target = torch.rand(2, 3)
print("Predicted: ",predicted,"\n")
print("Target: ",target,"\n")
print("KLDivLoss no reduction: ",nn.KLDivLoss(reduction='none')(predicted, target),"\n")
print("KLDivLoss: ",nn.KLDivLoss()(predicted, target),"\n")
x = predicted.numpy()
y = target.numpy()
lst = []
for i in range(len(x)):
lsti = []
for j in range(len(x[i])):
# xi is already log
lsti.append(y[i][j] * (np.log(y[i][j]) - x[i][j]))
lst.append(lsti)
print("",np.array(lst),"\n")
print("",np.mean(lst),"\n")
BCELoss creates a criterion that measures the Binary Cross Entropy between the target and the output.
Target (either 0 or 1): {0,1} binary label of the target class - One hot vector which is the class label of the sample which is being used for training will only remain 1 while rest all are zeros.
Prediction: [0,1] Classification probability score from the neural network. Need to use sigmoid to make sure the range is between 0 and 1.
Cross entropy error works only with two or more values that sum to 1.0 (a probability distribution) so you can’t directly use CE error if you are doing binary classification with a single output node. General cross-entropy is defined as a discrete version, which takes in a vector with more than two values, with their probability distributions.
Suppose we have a binary prediction problem, for example, we want to predict if a person is Male (class = 0) or Female (class = 1). The most common approach is to spit out a single value that represents the probability of the class encoded as 1. For example, if the predicted output is 0.65 then because that value is greater than 0.5, the prediction is Female.
Thus since the above is a two-category problem, as there are only positive and negative examples, and the probability sum of the two is 1, then only one probability needs to be predicted.
BCELoss requires an output to be piped into a sigmoid function before going to BCE. Sigmoid works here very well as it produces an output between 0 and 1, otherwise if output values would incorporate 0 then BCE could not be computed as log 0 is not defined.
Softmax is a generalised sigmoid activation function for K outputs hence used for general cross entropy loss. It is required because individually outputs would not sum up to 1 so we need to normalise it and thats what softmax does for us.
BCELoss is used for measuring the error of a reconstruction in for example an auto-encoder. Note that the targets y should be numbers between 0 and 1.
Unlike Softmax loss it is independent for each vector component (class), meaning that the loss computed for every vector component is not affected by other component values. That’s why it is used for multi-label classification, were the insight of an element belonging to a certain class should not influence the decision for another class.
In [0]:
# Sigmoid
x = torch.randn(2, 4)
y = nn.Sigmoid()(x)
xnp = x.numpy()
print("Predicted: ",x)
print("Target: ",y)
print("Sigmoid (Predicted): ",1 / (1 + np.exp(-xnp)),'\n')
# Single label - Notice y is a multiclass single label
x0 = torch.randn(3)
x = nn.Sigmoid()(x0)
y = torch.FloatTensor(3).random_(2)
print("Predicted: ",x0)
print("Sigmoid (Predicted): ",x)
print("Target: ",y)
print("BCELoss without none: ",nn.BCELoss(reduction='none')(x, y))
print("BCELoss",nn.BCELoss()(x, y),'\n')
x = x.numpy()
y = y.numpy()
lst = []
for i in range(len(x)):
lst.append(-np.log(x[i]) if y[i]==1 else -np.log(1-x[i]))
print(lst, np.mean(lst),'\n')
# Equivalently
lst = []
for i in range(len(x)):
lst.append(-np.log(x[i])*y[i] + -np.log(1-x[i])*(1-y[i]))
print(lst, np.mean(lst),'\n')
In [0]:
# Multilabel - notice y has multiple labels with each label being a multiclass
import numpy as np
import torch
import torch.nn as nn
x0 = torch.randn(3, 2)
x = nn.Sigmoid()(x0)
y = torch.FloatTensor(3, 2).random_(2)
print("Predicted: ",x0)
print("Sigmoid (Predicted): ",x)
print("Target: ",y)
print("BCELoss without none: ",nn.BCELoss(reduction='none')(x, y))
print("BCELoss",nn.BCELoss()(x, y),'\n')
x = x.numpy()
y = y.numpy()
lst = []
for i in range(len(x)):
lsti = []
for j in range(len(x[i])):
lsti.append(-np.log(x[i][j]) if y[i][j]==1 else -np.log(1-x[i][j]))
lst.append(lsti)
print(np.array(lst))
print(np.mean(lst),"\n")
# Equivalently
lst = []
for i in range(len(x)):
lst.append(-np.log(x[i])*y[i] + -np.log(1-x[i])*(1-y[i]))
print(np.array(lst))
print(np.mean(lst))
BCEWithLogitsLoss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.
It just simply adds a sigmoid in front of BCELoss above.
This is used for measuring the error of a reconstruction in for example an auto-encoder.
Note that the targets should be numbers between 0 and 1.
It’s possible to trade off recall and precision by adding weights to positive examples.
For example, if a dataset contains 100 positive and 300 negative examples of a single class, then pos_weight for the class should be equal to $\frac{300}{100}=3$. The loss would act as if the dataset contains $3\times 100=300$ positive examples.
If we are doing say image segmentation with PixelWise, we could just use CrossEntropyLoss over the output channel dimension.
BCEWithLogitsLoss is needed when we have soft-labels (i.e. instead of {dog at (1, 1), cat at (4, 20)}, it is like {dog with strength 0.3 at (1,1), …}
In [0]:
# Single label
import numpy as np
import torch
import torch.nn as nn
x = torch.randn(3)
xs = nn.Sigmoid()(x)
y = torch.FloatTensor(3).random_(2)
print("x: ",x)
print("xs: ",xs)
print("y: ",y)
print(nn.BCELoss()(xs, y))
print(nn.BCEWithLogitsLoss()(x, y))
In [0]:
# Multilabel
x = torch.randn(3, 2)
xs = nn.Sigmoid()(x)
y = torch.FloatTensor(3, 2).random_(2)
print(nn.BCELoss()(xs, y))
print(nn.BCEWithLogitsLoss()(x, y))
MarginRankingLoss creates a criterion that measures the loss given inputs x1, x2, which are two 1D mini-batch Tensor s, and a label 1D mini-batch tensor y with values (1 or -1). $Loss ( x1, x2, y)$
The three are scalars, y can only take 1 or -1, and 1 means x1 is larger than x2; otherwise x2 is larger. The parameter margin indicates that the two vectors must at least match the size of the margin, otherwise loss is not negative. The default margin is zero.
Thus if y == 1 then it assumed the first input should be ranked higher (have a larger value) than the second input, and vice-versa for y == -1.
In [0]:
x1 = torch.randn(3)
x2 = torch.randn(3)
print("x1: ",x1)
print("x2: ",x2)
y = torch.FloatTensor(np.random.choice([1, -1], 3))
print("y: ",y)
print(nn.MarginRankingLoss(reduction='none',margin=0.1)(x1, x2, y))
print(nn.MarginRankingLoss(margin=0.1)(x1, x2, y))
x1 = x1.numpy()
x2 = x2.numpy()
y = y.numpy()
margin=0.1
lst = []
for i in range(len(x1)):
lst.append(max(0, -y[i]*(x1[i]-x2[i]) + margin))
print(lst)
print(np.mean(lst))
HingeEmbeddingLoss measures the loss given an input tensor x and a labels tensor y containing values (1 or -1). This is usually used for measuring whether two inputs are similar or dissimilar, e.g. using the L1 pairwise distance as x, and is typically used for learning nonlinear embeddings or semi-supervised learning.
Normal HingeLoss is often used in SVMs.
In [0]:
x = torch.randn(2, 3)
y = torch.FloatTensor(np.random.choice([-1, 1], (2, 3)))
print("x: ",x)
print("y: ",y)
print(nn.HingeEmbeddingLoss(reduction='none',margin=1)(x, y))
print(nn.HingeEmbeddingLoss(margin=1)(x, y))
x = x.numpy()
y = y.numpy()
margin=1
lst=[]
for i in range(len(x)):
lsti = []
for j in range(len(x[i])):
if y[i][j]==1:
lsti.append(x[i][j])
else:
lsti.append(max(0, margin-x[i][j]))
lst.append(lsti)
print(np.array(lst))
print(np.mean(lst))
MultiLabelMarginLoss creates a criterion that optimizes a multi-class multi-classification hinge loss (margin-based loss) between input x (a 2D mini-batch Tensor) and output y (which is a 2D Tensor of target class indices)
The criterion only considers a contiguous block of non-negative targets that starts at the front.
If y=[5,3,0,0,4] then it will be considered to belong to categories 5 and 3, and 4 will be ignored because it is after zero.
This allows for different samples to have variable amounts of target classes
Given:
In here, the 3rd, 0th are the correct classes, 1, 2 are not.
In [0]:
import torch
loss = torch.nn.MultiLabelMarginLoss()
x = torch.FloatTensor([[0.1, 0.2, 0.4, 0.8]])
y = torch.LongTensor([[3, 0, -1, 1]])
print(loss(x, y)) # will give 0.8500
In [0]:
# One-sample example
x = torch.randn(1, 4)
y = torch.LongTensor(1, 4).random_(-1, 4)
print("x: ",x)
print("y: ",y)
print(nn.MultiLabelMarginLoss(reduction='none')(x, y))
print(nn.MultiLabelMarginLoss()(x, y))
x = x.numpy()
y = y.numpy()
lst = []
for k in range(len(x)):
sm = 0
js = []
for j in range(len(y[k])):
if y[k][j]<0: break
js.append(y[k][j])
for i in range(len(x[k])):
for j in js:
if (i not in js) and (i!=j):
print(i, j)
sm += max(0, 1-(x[k][j] - x[k][i]))
lst.append(sm/len(x[k]))
print(lst)
print(np.mean(lst))
In [0]:
# Multi-sample example
x = torch.randn(3, 4)
y = torch.LongTensor(3, 4).random_(-1, 4)
print("x: ",x)
print("y: ",y)
print(nn.MultiLabelMarginLoss(reduction='none')(x, y))
print(nn.MultiLabelMarginLoss()(x, y))
x = x.numpy()
y = y.numpy()
lst = []
for k in range(len(x)):
sm = 0
js = []
for j in range(len(y[k])):
if y[k][j]<0: break
js.append(y[k][j])
for i in range(len(x[k])):
for j in js:
if (i not in js) and (i!=j):
sm += max(0, 1-(x[k][j] - x[k][i]))
lst.append(sm/len(x[k]))
print(lst)
print(np.mean(lst))
SmoothL1Loss creates a criterion that uses a squared term if the absolute element-wise error falls below 1 and an L1 term otherwise. It is less sensitive to outliers than the MSELoss and in some cases prevents exploding gradients (e.g. see “Fast R-CNN” paper by Ross Girshick). Also known as the Huber loss.
In [0]:
x = torch.randn(2, 3)
y = torch.randn(2, 3)
print("x: ",x)
print("y: ",y)
nn.SmoothL1Loss()(x, y)
nn.SmoothL1Loss(reduction='none')(x, y)
x = x.numpy()
y = y.numpy()
def smoothl1loss(x, y):
if abs(x-y)<1: return 1/2*(x-y)**2
else: return abs(x-y)-1/2
lst = []
for i in range(len(x)):
lsti=[]
for j in range(len(x[i])):
lsti.append(smoothl1loss(x[i][j], y[i][j]))
lst.append(lsti)
print(np.array(lst))
print(np.mean(lst))
SoftMarginLoss creates a criterion that optimizes a two-class classification logistic loss between input tensor x and target tensor y (containing 1 or -1).
It solves Multi-label two-category (binary classification) problem, where the N items are two-category problems. In fact, the losses of N over two-categories are added together and simplified.
Typically Margin loss is used to find output of the neural network which can be in the range (-1,1). Whichever class exists, it is associated with +1 and whichever class does not exist it is associated with -1.
The target will be either -1 or 1, {-1,1} where the prediction can be in range of (-1,1).
In here the nonlinearity has to be tanh or similar function which can generate the predictions in the range of (-1,1).
The default margin criteria is 1 and is basically the amount of tolerance allowed. Typically used when the nonlinearities are between (-1,1).
Images like synthetic aperture radar or ultrasound images or mr signals, there exists a sign number systems and once normalized, the inputs fall between the range (-1,1).
In SoftMargin loss we replace function of margin loss with log functions so that we can scale or have a continuous type of function.
Marginloss has a discontinuity problem while taking a derivative.
In [0]:
x = torch.randn(2, 4)
y = torch.FloatTensor(np.random.choice([-1, 1], (2, 4)))
print("x: ",x)
print("y: ",y)
print(nn.SoftMarginLoss(reduction='none')(x, y))
print(nn.SoftMarginLoss()(x, y))
x = x.numpy()
y = y.numpy()
lst = []
for k in range(len(x)):
sm = 0
for i in range(len(x[k])):
sm += np.log(1 + np.exp(-y[k][i]*x[k][i]))
lst.append(sm/len(x[k]))
print(lst, np.mean(lst))
MultiLabelSoftMarginLoss creates a criterion that optimizes a multi-label one-versus-all loss based on max-entropy, between input x and target y of size (N, C). $y$ can only take 1, 0, representing positive and negative classes.
In [0]:
x = torch.randn(2, 4)
y = torch.FloatTensor(2, 4).random_(2)
print("x: ",x)
print("y: ",y)
print(nn.MultiLabelSoftMarginLoss(reduction='none')(x, y))
print(nn.MultiLabelSoftMarginLoss()(x, y))
x = x.numpy()
y = y.numpy()
lst = []
for k in range(len(x)):
sm = 0
for i in range(len(x[k])):
sm -= y[k, i]*np.log(np.exp(x[k, i])/(1+np.exp(x[k, i]))) +\
(1-y[k, i])*np.log(1/(1+np.exp(x[k, i])))
lst.append(sm/len(x[k]))
print(lst)
print(np.mean(lst))
CosineEmbeddingLoss creates a criterion that measures the loss given input tensors $x_1$, $x_2$ and a Tensor label $y$ with values $1$ or $-1$. This is used for measuring whether two inputs are similar or dissimilar, using the cosine distance, and is typically used for learning nonlinear embeddings or semi-supervised learning.
The loss of cosine similarity is intended to make the two vectors as close as possible. Note that both vectors are gradient.
Margin can take values between [−1,1], but the it is recommended to take between 0-0.5.
In [0]:
x1 = torch.randn(2, 3)
x2 = torch.randn(2, 3)
y = torch.FloatTensor(np.random.choice([1, -1], 2))
print("x1: ",x1)
print("x2: ",x2)
print("y: ",y)
print(nn.CosineEmbeddingLoss(reduction='none',margin=0.1)(x1, x2, y))
print(nn.CosineEmbeddingLoss(margin=0.1)(x1, x2, y))
x1 = x1.numpy()
x2 = x2.numpy()
y = y.numpy()
margin=0.1
from scipy.spatial.distance import cosine
def cos(x, y): return 1-cosine(x, y)
lst = []
for k in range(len(x1)):
if y[k] == 1: lst.append(1-cos(x1[k], x2[k]))
elif y[k] == -1: lst.append(max(0, cos(x1[k], x2[k])-margin))
print(lst)
print(np.mean(lst))
MultiMarginLoss creates a criterion that optimizes a multi-class classification hinge loss (margin-based loss) between input x (a 2D mini-batch Tensor) and output y (which is a 1D tensor of target class indices, $0 \leq y \leq \text{x.size}(1)$)
Optionally, you can give non-equal weighting on the classes by passing a 1D weight tensor into the constructor.
In [0]:
x = torch.randn(2, 4)
y = torch.LongTensor(2).random_(4)
print("x: ",x)
print("y: ",y)
print(nn.MultiMarginLoss(reduction='none',margin=0.9, p=2)(x, y))
print(nn.MultiMarginLoss(margin=0.9, p=2)(x, y))
x = x.numpy()
y = y.numpy()
p=2
margin=0.9
lst = []
for k in range(len(x)):
sm = 0
for i in range(len(x[k])):
if i!= y[k]:
sm += max(0, (margin - x[k, y[k]] + x[k, i])**p)
lst.append(sm/len(x[k]))
print(lst)
print(np.mean(lst))
TripletMarginLoss creates a criterion that measures the triplet loss given an input tensors $x1, x2, x3$ and a margin with a value greater than 0. This is used for measuring a relative similarity between samples. A triplet is composed by a, p and n: anchor, positive examples and negative example respectively. The shapes of all input tensors should be (N,D).
In [0]:
x1 = torch.randn(2, 3)
x2 = torch.randn(2, 3)
x3 = torch.randn(2, 3)
margin = 0.9
p = 2
print("x1: ",x1)
print("x2: ",x2)
print("x3: ",x3)
print(nn.TripletMarginLoss(reduction='none',margin=margin, p=p)(x1, x2, x3))
print(nn.TripletMarginLoss(margin=margin, p=p)(x1, x2, x3))
x1 = x1.numpy()
x2 = x2.numpy()
x3 = x3.numpy()
def d(x1, x2, p):
return sum((x1-x2)**p)**(1/p)
lst = []
for k in range(len(x1)):
sm = 0
for i in range(len(x1[k])):
sm += max(d(x1[k], x2[k], p)-d(x1[k], x3[k], p)+margin, 0)
lst.append(sm/len(x1[k]))
print(lst)
print(np.mean(lst))
There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.
Consider the mean square error:
$$MSE = \frac{1}{T} \sum_{i=1}^{T} (y - wx)^2 $$Notice that the mean square error (cost) is the mean sum of individual losses computed for all of the given data points per weight.
Thus the derivative is also computed for all the given data points (sum of all the derivatives)
$$\frac{dL}{dw} = \frac{1}{T} \sum_{i=1}^{T} \; -2x\left(y-wx\right)$$And most importantly weight update is done "from" the sumed value of all the derivatives
$$w = w - η⋅\frac{dL}{dw} = w - η⋅\frac{1}{T} \sum_{i=1}^{T} \; -2x\left(y-wx\right)$$Above algorithm says, to perform the grad descent, we need to calculate the gradient of the cost function i.e, cumulative sum of each loss per sample.
If we have 3 million samples, we have to loop through 3 million times or use the dot product just for a tiny update of the weight
In [0]:
# Accumulates gradients of the entire data but only make one update
loop maxEpochs times
# Begin Inner Loop
for-each data item
compute a gradient for each weight and bias
accumulate gradient
end-for
# End Inner Loop
use accumulated gradients to update weight and bias
end-loop
In [0]:
def batch_gradient_descent():
w, b, lr, max_epochs = -2, -2, 1.0, 1000
for i in range(max_epochs):
dw, db=0, 0
for x,y in zip(X,Y):
dw += grad_w(w, b, x, y)
db += grad_w(w, b, x, y)
w = w - dw * lr
b = b - db * lr
It uses the whole dataset to calculate the gradient of the loss function. The descent can be very slow, because only one update is performed for the whole dataset. Also the whole dataset needs to fit in the memory which can be a problem especially for very large datasets Batch gradient descent also doesn't allow us to update our model online, i.e. with new examples on-the-fly.
Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.
In batch training, an accumulated gradient for each weight and bias is computed using all training data items, and then weights and biases are updated.
However in stochastic training (also called online training), after the gradients are computed for a single training item, weights and biases are updated immediately. Put another way, in stochastic training, the accumulated gradient values of batch training are estimated using single-training data items.
Notice that the algorithm updates the parameters for every single data point.
Also notice that we are missing the summation update as we are just doing point estimation. Thus SGD is only an approximate gradient and thus no guarantee that each step will decrease the loss.
A parameter update which is locally favorable to one point may harm other points as each point is trying to push the parameters in a direction most favourable to it.
In [0]:
loop maxEpochs times
# Begin Inner Loop
for-each data item
compute gradients for each weight and bias
use gradients to update each weight and bias
end-for
# End Inner Loop
end-loop
In [0]:
def stochastic_gradient_descent():
w, b, lr, max_epochs = -2, -2, 1.0, 1000
for i in range(max_epochs):
dw, db=0, 0
shuffle(X,Y)) # We shuffle the training data at every epoch
for x,y in zip(X,Y):
dw = grad_w(w, b, x, y)
db = grad_w(w, b, x, y)
w = w - dw * lr
b = b - db * lr
In [0]:
# STOCHASTIC GRADIENT DESCENT
for epoch in range(4):
for x,y in zip(x,y):
yhat = forward(x)
loss = criteria(yhat,y)
loss.backward()
w.data = w.data - lr * w.grad.data
b.data = b.data - lr * b.grad.data
w.grad.data.zero()
b.grad.data.zero()
# STOCHASTIC GRADIENT DESCENT with trainloader
for epoch in range(4):
for x,y in trainloader(x,y):
yhat = forward(x)
loss = criteria(yhat,y)
loss.backward()
w.data = w.data - lr * w.grad.data
b.data = b.data - lr * b.grad.data
w.grad.data.zero()
b.grad.data.zero()
Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn online.
SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily (noisy approximations).
While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD's fluctuation is large (oscillations), on the one hand, enables it to jump to new and potentially better local minima.
On the other hand, this ultimately complicates convergence to the exact minimum, as SGD will keep overshooting.
However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively.
Note that we shuffle the training data at every epoch and learning rate should be chosen carefully as for large learning rates then the weight matrix cannot converge as for too small learning rates, the require number of iterations for converging will be very high.
In SGD only one point tells us which is the right direction to go on and that causes oscillations unlike the batch GD where the update is for all cumulative data points
Mini-batch training is a combination of batch and stochastic training. Instead of using all training data items to compute gradients (as in batch training) or using a single training item to compute gradients (as in stochastic training), mini-batch training uses a user-specified number of training items.
So in mini-batch gradient descent we choose a set of random samples (batch) from the training set and for every sample in the batch, we calculate the gradient and then average all the gradients.
We then use the calculated average of the gradients in a batch to update the weight.
Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples.
This way, it
However we have to set mini-batch size manually.
Common mini-batch sizes range between 50 and 256, but can vary for different applications. Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used.
Note that learning rate should still be chosen carefully.
In pseudo-code, mini-batch training is:
In [0]:
loop maxEpochs times
loop until all data items used
# Begin Inner Loop
for-each batch of items
compute a gradient for each weight and bias
accumulate gradient
end-batch
# End Inner Loop
use accumulated gradients to update weight and bias
end-loop all item
end-loop
Thus there are three variants of gradient descent:
The difference of these algorithms is the amount of data.
Depending on the amount of data, they make a trade-off:
In the following, we will outline some algorithms that are widely used by the deep learning community to deal with the aforementioned challenges.
The traditional Standard/Batch Gradient Descent will calculate the gradient using the whole data set but will perform only one update , hence it can be very slow and hard to control for datasets which are very very large and don’t fit in the Memory.
Stochastic Gradient Descent(SGD) on the other hand gradient using a single sample. It is usually much faster technique. It performs one update at a time. In other words, SGD tries to find minima or maxima by iteration. The problem with SGD is that due to the frequent updates and fluctuations it ultimately complicates the convergence to the exact minimum and will keep overshooting due to the frequent fluctuations.
The high variance oscillations in SGD makes it hard to reach convergence, so a technique called Momentum was invented which accelerates SGD by navigating along the relevant direction and softens the oscillations in irrelevant directions.
The concept with Momentum is that if you are continously moving in the same direction for sometime then probably you should gain confidence and start taking bigger steps in that direction.
Momentum based technique solves:
We introduce here a new term called velocity, which considers the previous update and a constant which is called momentum.
velocity = momentum * previous_update
weight = weight - (velocity + gradient * lr)
The term velocity in addition to the current step update considers the history also.
Saddle points
A saddle point is basically a flat point in the cost surface and because the surface is flat, the gradients are zero. When we perform the update for the weight parameter, nothing will happen and we tend to get struck on the saddle point.
When a parameter update is performed with momentum term, we get a bigger jump in parameter values and get past the the saddle point.
Local minima
Notice we have both a local minimum and global minimum in the figure below. In general at times the gradient can get struck at the local minimum.
When a momentum term is used to do a parameter update, and if the chosen momentum is too small, basically the derivative is going to overtake actual velocity term and we will get struck in a local minimum.
A good value of momentum will reach the global minimum.
A large value of momentum will basically miss the global minimum.
Gradient noise
We have a true cost function in blue and a noisy approximation in red. With a momentum term the approximation may take longer to converge due to all the noise but doesnt get struck in the local minimum. Notice that the velocity term is averaging out the noise (smoothing the surface).
High condition number
The condition number gives us a measure of how poorly gradient descent will perform. A problem with a low condition number is said to be well-conditioned, while a problem with a high condition number is said to be ill-conditioned. The condition number is a property of the problem.
If the cost function has a condition number of one it's basically round. And the gradient will reach the minimum fast however if the cost function has a high condition number, the gradient will take a long time converge to the minimum.
With momentum term, cost functions with high condition numbers can converge.
Momentum:
The momentum term reduces updates for dimensions whose gradients change directions.
The momentum term increases for dimensions whose gradients point in the same directions.
The momentum term $\gamma$ is usually set to 0.9 or a similar value. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way.
Momentum helps accelerate SGD in the relevant direction and dampens any oscillations.
Learning rate and momentum must be choosen manually.
In [0]:
def momentum_gradient_descent():
prev_v_w, prev_v_b, momentum = 0, 0, 0.9
for i in range(max_epochs):
dw, db=0, 0
for x,y in zip(X,Y):
dw += grad_w(w, b, x, y)
db += grad_w(w, b, x, y)
w = w - (momentum * prev_v_w + dw * lr)
b = b - (momentum * prev_v_b + db * lr)
prev_v_w = (momentum * prev_v_w + dw * lr)
prev_v_b = (momentum * prev_v_b + db * lr)
In [0]:
def momentum_stochastic_gradient_descent():
prev_v_w, prev_v_b, momentum = 0, 0, 0.9
for i in range(max_epochs):
dw, db=0, 0
for x,y in zip(X,Y):
dw += grad_w(w, b, x, y)
db += grad_w(w, b, x, y)
w = w - (momentum * prev_v_w + dw * lr)
b = b - (momentum * prev_v_b + db * lr)
prev_v_w = (momentum * prev_v_w + dw * lr)
prev_v_b = (momentum * prev_v_b + db * lr)
In classical momentum we first correct our velocity and then make a big step according to that velocity (and then repeat).
However Nesterov momentum we first make a step into velocity direction and then make a correction to a velocity vector based on new location (then repeat).
Actually, this makes a huge difference in practice
In nesterov, we look ahead at the gradient points and thus inturn make smaller turns to reach to the minimum unlike in momentum where a larger overshoot can make it end farther to the minimum thus taking more time to reach to the minimum.
Hence oscillations are smaller and the chance of escaping the minima valley also smaller.
In [0]:
def _nesterov_momentum_gradient_descent():
prev_v_w, prev_v_b, momentum = 0, 0, 0.9
for i in range(max_epochs):
dw, db=0, 0
# Calculate velocity
v_w= momentum * prev_v_w
v_b= momentum * prev_v_b
for x,y in zip(X,Y):
# Update weights and then calculate gradients
dw += grad_w(w - v_w, b - v_b, x, y)
db += grad_w(w - v_w, b - v_b, x, y)
w = w - (momentum * prev_v_w + dw * lr)
b = b - (momentum * prev_v_b + db * lr)
prev_v_w = (momentum * prev_v_w + dw * lr)
prev_v_b = (momentum * prev_v_b + db * lr)
In [0]:
def _nesterov_momentum__stochastic_gradient_descent():
prev_v_w, prev_v_b, momentum = 0, 0, 0.9
for i in range(max_epochs):
dw, db=0, 0
# Calculate velocity
v_w= momentum * prev_v_w
v_b= momentum * prev_v_b
for x,y in zip(X,Y):
# Update weights and then calculate gradients
dw += grad_w(w - v_w, b - v_b, x, y)
db += grad_w(w - v_w, b - v_b, x, y)
w = w - (momentum * prev_v_w + dw * lr)
b = b - (momentum * prev_v_b + db * lr)
prev_v_w = (momentum * prev_v_w + dw * lr)
prev_v_b = (momentum * prev_v_b + db * lr)
Mini batch versions of Momentum and Nesterov can also be demonstrated.
There have been several attempts to use heuristics for estimating a good learning rate at each iteration of gradient descent. These either attempt to speed up learning when suitable or to slow down learning near a local minima. Here we consider the latter.
When gradient descent nears a minima in the cost surface, the parameter values can oscillate back and forth around the minima. One method to prevent this is to slow down the parameter updates by decreasing the learning rate.
This can be done manually when the validation accuracy appears to plateau.
Alternatively, learning rate schedules have been proposed to automatically anneal the learning rate based on how many epochs through the data have been done. These ap- proaches typically add additional hyperparameters to control how quickly the learning rate decays.
Setting initial learning rate:
Annealing learning rate:
Momentum:
The following schedule was suggested by sutskever et al. 2014 $$\gamma_t = min (1 -2^{-1-log_2([t/250]+1)}, \gamma_{max})$$
Where $\gamma_{max}$ was choosen from {0.999, 0.995, 0.9, 0.9, 0} and
$\epsilon$ = {0.05, 0.01, 0.005, 0.001, 0. 0005, 0.0001}
Momentum and nesterov methods work better with difficult functions with complex level sets however they still require to choose learning rate manually and they are very sensitive to the the selection.
Adagrad is an algorithm that adapts learning rate to parameters. That is it chooses the learning rate adaptively so do that we dont have to choose it ourselves.
It performs larger updates for rarely changed parameters and smaller updates for parameters that are changed frequently. That is achieved by taking into account all previous gradients of each parameter. However that causes the learning rate to decrease with every training step and it eventually becomes too small for the network to learn anything. In this algorithm, on the basis of how the gradient has been changing for all the previous iterations we try to change the learning rate
grad_component = previous_grad_component + (gradient * gradient)
rate_change = square_root(grad_component) + epsilon
adapted_learning_rate = learning_rate * rate_change
update = adapted_learning_rate * gradient
weight = weight – update
In the above code, epsilon is a constant which is used to keep rate of change of learning rate in check.
Adadelta
Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size w.
Instead of inefficiently storing w previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients.
RMSprop
RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton.
Adam
ADAM is one more adaptive technique which builds on adagrad and further reduces it downside. In other words, we can consider this as momentum + ADAGRAD.
Infact Adam is a combination of momentum and Adadelta. In addition to storing the past n squared gradients like Adadelta, it also stores past gradients similar to momentum. The authors show that Adam works well in practice and that it is comparable to other adaptive learning algorithms
adapted_gradient = previous_gradient + ((gradient – previous_gradient) * (1 – beta1))
gradient_component = (gradient_change – previous_learning_rate)
adapted_learning_rate = previous_learning_rate + (gradient_component * (1 – beta2))
update = adapted_learning_rate * adapted_gradient
weight = weight – update
Here beta1 and beta2 are constants to keep changes in gradient and learning rate in check
There are also second order differentiation method like l-BFGS.
In addition to storing an exponentially decaying average of past squared gradients like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients, similar to momentum.
Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface.
AdaMax
Adamax is supposed to be used when you’re using some setup that has sparse parameter updates (ie word embeddings).
Pytorch
PyTorch provides the torch.optim package, which defines a number of common optimization algorithms, such as:
torch.optim.SGD: stochastic gradient descent,
torch.optim.Adam: adaptive moment estimation,
torch.optim.RMSprop: an algorithm developed by Geoffrey Hinton in his Coursera course,
torch.optim.LBFGS: limited-memory Broyden–Fletcher–Goldfarb–Shanno,
Each of these optimizers are constructed with a list of parameter objects, usually retrieved via the parameters() method of a nn.Module subclass, that determine which values are updated by the optimizer. Besides this parameter list, the optimizers each take a certain number of additional arguments to configure their optimization strategy.
Other gradient descent algorithms include:
Which ones to use:
For rapid prototyping, use adaptive techniques like Adam/Adagrad. These help in getting quicker results with much less efforts. As here, you don’t require much hyper-parameter tuning.
To get the best results, you should use vanilla gradient descent or momentum. gradient descent is slow to get the desired results, but these results are mostly better than adaptive techniques.
If your data is small and can be fit in a single iteration, you can use 2nd order techniques like l-BFGS. This is because 2nd order techniques are extremely fast and accurate, but are only feasible when data is small enough.
There also an emerging method to use learned features to predict learning rates of gradient descent.
Data Loading
For convenience, PyTorch provides a number of utilities to load, preprocess and interact with datasets. These helper classes and functions are found in the torch.utils.data module. The two major concepts here are:
A Dataset, which encapsulates a source of data,
A DataLoader, which is responsible for loading a dataset, possibly in parallel.
New datasets are created by subclassing the torch.utils.data.Dataset class and overriding the len method to return the number of samples in the dataset and the getitem method to access a single value at a certain index
Iterations = Training size / Batch Size of Samples
We can calculate the number of iterations to loop through the entire data once.
Batch size can affect performance of a model (convergence)
Clear the gradient before you call .backward() the second time we can place anywhere it suits
Use validation data to compare two models (not training data) and check for signs of overfitting by comparing their costs. Do two plot comparison.
Training data is used to train the model; validation data is used to obtain hyperparameters.
We need a class that is either class 0 or class 1. For that we use a threshold function such that if z > 0 it will return 1 and if z < 0 it will return 0.
And sigmoid logistic function looks similar to the threshold function where If the value for Z is a very large negative number we get a 0. And for a very large positive number we get a 1. And everything in the middle is around 0.5.
the input size depends on the input of your feature dimension.
flatten out into a 1x784 tensor (columwise features)
input dim is 28x28 features since we treat and associate weight with each pixel (so 784 per image) output dim is 10 element (one per class) or 10 classes
print("w:", list(model.parameters()[0].size)) print("b:", list(model.parameters()[1].size))
PlotParameters(model)
cross entropy loss has to be n and not n x 1
we can use validation data to determine the optimum number of hidden neurons, and we can also use regularization.
In [0]:
def PlotParameters(model):
W=model.state_dict() ['linear.weight'].data
w_min=W.min().item()
w_max=W.max().item()
fig, axes = plt.subplots(2, 5)
fig.subplots_adjust(hspace=0.01, wspace=0.1)
for i,ax in enumerate(axes.flat):
if i<10:
# Set the label for the sub-plot.
ax.set_xlabel( "class: {0}".format(i))
# Plot the image.
ax.imshow(W[i,:].view(28,28), vmin=w_min, vmax=w_max, cmap='seismic')
ax.set_xticks([])
ax.set_yticks([])
# Ensure the plot is shown correctly with multiple plots
# in a single Notebook cell.
plt.show()
The loss function criterion=nn.CrossEntropyLoss() will convert the linear function to a probability . You can apply nn.Softmax To convert the logit to a probability
input_dim=28*28
output_dim=10
model = nn.Sequential(nn.Linear(input_dim,output_dim), nn.Softmax(dim=1))
This gives a final accuracy in the accuracy_list of 0.9092 which is the same as the Softmax(input_dim, output_dim) custom module.
Good point, the nn.Softmax() just creates an object that normalizes the output.You can apply it in several different ways here is another way :
*sm=nn.Softmax(dim=1)
z=sm(model(data_set.x))
_, yhat = z.max(1)
yhat
In the Pytorch documentation, they usually leave the last layer as a linear layer. Thanks for the feed back.
we use our BCE loss because our output is a logistic unit.
In [0]:
# STOCHASTIC GRADIENT DESCENT
for epoch in range(4):
for x,y in zip(x,y):
yhat = forward(x)
loss = criteria(yhat,y)
loss.backward()
w.data = w.data - lr * w.grad.data
b.data = b.data - lr * b.grad.data
w.grad.data.zero()
b.grad.data.zero()