Title: Defining Activation Functions Slug: activation-functions Summary: A Overview of Implementing Activation Functions in Your Own Neural Network Date: 2018-01-1 09:11 Category: Neural Networks Tags: Basics Authors: Thomas Pinder

Activation functions are an integral part to a neural network, mapping the weighted input to a range of outputs. It is through the use of an activation function that a neural network can model non-linear mappings and consequently, the choice of activation function is important. In this brief summary, the sigmoid, tanh, ReLU and softmax activation functions will be presented along with an implementation.

Preliminaries

With all activation functions, not only is the function itself needed, but also the functions derivative will be needed for back propogation. Some people prefer to define these in seperate functions, however I prefer to have it wrapped up in one function for conciseness. For all of the following activation functions, the NumPy library should be loaded.


In [3]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Sigmoid

The sigmoid function was once the default choice of activation function when building a network and to some extent it still is. By mapping values into a range between 0 and 1 it lacks the beneficial quality of being zero centered - a property that aids gradient descent during back propogation.


In [4]:
def activation_sigmoid(x, derivative):
    sigmoid_value = 1/(1+np.exp(-x))
    if not derivative:
        return sigmoid_value
    else:
        return sigmoid_value*(1-sigmoid_value)

When plotted on a range of -5,5, this gives the following shape.


In [5]:
x_values = np.arange(-5, 6, 0.1)
y_sigmoid = activation_sigmoid(x_values, derivative=False)

plt.plot(x_values, y_sigmoid)


Out[5]:
[<matplotlib.lines.Line2D at 0x7f57200b5c50>]

Tanh

tanh is very similar in shape to the sigmoid, however the defining difference is that tanh ranges from -1 to 1, making it zero centered and consequently a very popular choice. Conveniently, tanh is pre-defined in NumPy, however it is still worthwhile wrapping it up in a function in order to define the derivative of tanh.


In [7]:
def activation_tanh(x, derivative):
    tanh_value = np.tanh(x)
    if not derivative:
        return tanh_value
    else:
        return 1-tanh_value**2

y_tanh = activation_tanh(x_values, derivative = False)
plt.plot(x_values, y_tanh)


Out[7]:
[<matplotlib.lines.Line2D at 0x7f56f126b358>]

ReLU

The Rectified Linear Unit is another commonly used activation function with a range from 0 to infinity. A major advantage of the ReLU function is that, unlike the sigmoid and tanh, the gradient of the ReLU function does not vanish as the limits are approached. An additionaly benefit of the ReLU is its enhanced computational efficiency as shown by Krizhevsky et. al. who found the ReLU function to be six times faster than tanh.


In [9]:
def relu_activation(x, derivative):
    if not derivative:
        return x * (x>0)
    else:
        x[x <= 0] = 0 
        x[x > 0] = 1
        return x

y_relu = relu_activation(x_values, derivative=False)
plt.plot(x_values, y_relu)


Out[9]:
[<matplotlib.lines.Line2D at 0x7f56f11feb70>]

It is probably worth noting, that the leaky ReLU is a closely related function with the only difference being that values < 0 are not completely set to 0, instead multiplied by 0.01.

Softmax

The final function to be discussed is the softmax, a function typically used in the final layer of a network. The softmax function reduces the value of each neurone in the final layer to a value in the range of 0 and 1, such that all values in the final layer sum to 1. The benefit of this is that in a multi-classification problem, the softmax function will assign a probability to each class, allowing for deeper insight into the performance of the network to be obtained through metrics such as top-n error. Note, the softmax will sometimes be written with the omission of the subtraction of np.max(x) stablises the function due to the exponent in the softmax sometimes resulting in a value larger than what Python can accept (10 followed by 138 0s) being calculated.


In [11]:
def softmax_activation(x):
    exponent = np.exp(x - np.max(x))
    softmax_value = exponent/np.sum(exponent, axis = 0)
    return softmax_value
y_softmax = softmax_activation(x_values)
plt.plot(x_values, y_softmax)
print("The sum of all softmax probabilities can be confirmed as " + str(np.sum(y_softmax)))


Out[11]:
1.0

Conclusion

This brief discussion around the main activation functions used in neural networks should provide you with a good understanding of how each function works and the relationships between them. If in doubt, it is generally advisable to build your network using the ReLU function in the hidden layers and the softmax function in your final layer, however, it is often worth trialing different functions to be sure.