Activation functions are one of the most important choices to be made for the architecture of a neural network. Without an activation function, neural networks can essentially only act as a linear regression model. Since most datasets are non-linear to at least some degree, using an activation function facilitates the modeling of more complex relationships between variables, allowing a neural network to truly be a universal function approximator.
For each hidden layer node, the input nodes are multiplied by their current weights and then summed. As an example, consider a hidden layer node with three input nodes that might be modeled by:
n = (w1x1) + (w2x2) + (w3x3)
Where x1, x2, and x3 are input variables, and w1, w2, and w3 are the multiplicative weighting factors to be optimized by the neural network. Notice (as mentioned above) that this function is entirely linear. If we were to stop here, the best machine learning algorithm that the neural network could approximate is merely linear regression. Furthermore, each additional node in each new layer is still just a weighted linear combination of all previous inputs. If there is no activation function, the entire neural network could be represented by a single linear function!
The activation function takes this linear input and translates it into a nonlinear output. For example, a logistic activation function is given by:
f(x) = 1 / (1 - exp(-n))
Where "n" is the value of the hidden layer node, the same as the first equation above. This transfer function allows the output to model non-linear system behavior, which most real-world problems will exhibit.
The logistic activation function, mentioned above, looks like:
In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.arange(-5, 5, 0.01)
y = 1 / (1 + np.exp(-x))
plt.plot(x,y)
plt.title('Logistic Activation Function')
plt.xlabel('Input')
plt.ylabel('Output');
Logistic activation functions are often used for classification problems, because of the asymptotic behavior near the positive and negative ends of the function, with a gradual transition in between. Any value below -4 in the above plot is roughly 0, while values above 4 are very close to 1.
The hyperbolic tangent activation function (also known as "tanh") looks like:
In [2]:
x = np.arange(-5, 5, 0.01)
y = (2 / (1 + np.exp(-2*x)))-1
plt.plot(x,y)
plt.title('Tanh Activation Function')
plt.xlabel('Input')
plt.ylabel('Output');
Notice that the shape of the tanh function is very similar to the logistic function, but the "middle" of the function is much more sharply defined. Conceptually, the tanh function might work better in classification when the classes can be more clearly defined.
The arctan activation function looks very similar to the hyperbolic tangent:
In [3]:
x = np.arange(-5, 5, 0.01)
y = np.arctan(x)
plt.plot(x,y)
plt.title('Arctan Activation Function')
plt.xlabel('Input')
plt.ylabel('Output');
The arctan, again, has a similar shape to the previous two activation functions, but the "steepness" of the curve is interpolated in between the two.
All three of these activation functions are good choices to consider for classification problems.
Also known as ReLU, this activation function looks like:
In [4]:
x = np.arange(-5, 5, 0.01)
z = np.zeros(len(x))
y = np.maximum(z,x)
plt.plot(x,y)
plt.title('ReLU Activation Function')
plt.xlabel('Input')
plt.ylabel('Output');
Notice that the ReLU is not differentiable at 0, which may pose some problems for gradient descent or similar optimization algorithms. Given that values <= 0 have an output of zero, this means that ReLU is effective at "turning off" many initial nodes.
Finally, the softplus function looks like:
In [5]:
x = np.arange(-5, 5, 0.01)
y = np.log(1+np.exp(x))
plt.plot(x,y)
plt.title('Softplus Activation Function')
plt.xlabel('Input')
plt.ylabel('Output');
The behavior of the softplus activation function is similar to the ReLU, but it is fully differentiable at all points, which makes it attractive for some optimization algorithms.