This looks pretty neat. They can prove that when you slightly modify the ELU activation, your average unit activation goes towards zero mean/unit variance (if the network is deep enough). If they're right, this might make batch norm obsolete, which would be a huge bon to training speeds!
The experiments look convincing, so apparently it even beats BN+ReLU in accuracy... though
I wish they would've shown the resulting distributions of activations after training.
But assuming their fixed point proof is true, it will.
Still, still would've been nice if they'd shown it -- maybe they ran out of space in their appendix ;)
Weirdly, the exact ELU modification they proposed isn't stated explicitly in the paper!
For those wondering, it can be found in the available sourcecode, and looks like this:
In [1]:
# An extra explaination from Reddit
# # Thanks, I will double check the analytical solution. For the numerical one, could you please explain why running the following code results in a value close to 1 rather than 0?
# du = 0.001
# u_old = np.mean(selu(np.random.normal(0, 1, 100000000)))
# u_new = np.mean(selu(np.random.normal(0+du, 1, 100000000)))
# # print (u_new-u_old) / du
# print(u_old, u_new)
# # Now I see your problem:
# # You do not consider the effect of the weights.
# # From one layer to the next, we have two influences:
# # (1) multiplication with weights and
# # (2) applying the SELU.
# # (1) has a centering and symmetrising effect (draws mean towards zero) and
# # (2) has a variance stabilizing effect (draws variance towards 1).
# # That is why we use the variables \mu&\omega and \nu&\tau to analyze the both effects.
# # Oh yes, thats true, zero mean weights completely kill the mean. Thanks!
# Tensorflow implementation
import numpy as np
def selu(x):
alpha = 1.6732632423543772848170429916717
scale = 1.0507009873554804934193349852946
return scale * np.where(x>=0.0, x, alpha * (np.exp(x)-1))
In [2]:
# # Tensorflow implementation on github
# def dropout_selu(x, rate, alpha= -1.7580993408473766, fixedPointMean=0.0, fixedPointVar=1.0,
# noise_shape=None, seed=None, name=None, training=False):
# """Dropout to a value with rescaling."""
# def dropout_selu_impl(x, rate, alpha, noise_shape, seed, name):
# keep_prob = 1.0 - rate
# x = ops.convert_to_tensor(x, name="x")
# if isinstance(keep_prob, numbers.Real) and not 0 < keep_prob <= 1:
# raise ValueError("keep_prob must be a scalar tensor or a float in the "
# "range (0, 1], got %g" % keep_prob)
# keep_prob = ops.convert_to_tensor(keep_prob, dtype=x.dtype, name="keep_prob")
# keep_prob.get_shape().assert_is_compatible_with(tensor_shape.scalar())
# alpha = ops.convert_to_tensor(alpha, dtype=x.dtype, name="alpha")
# keep_prob.get_shape().assert_is_compatible_with(tensor_shape.scalar())
# if tensor_util.constant_value(keep_prob) == 1:
# return x
# noise_shape = noise_shape if noise_shape is not None else array_ops.shape(x)
# random_tensor = keep_prob
# random_tensor += random_ops.random_uniform(noise_shape, seed=seed, dtype=x.dtype)
# binary_tensor = math_ops.floor(random_tensor)
# ret = x * binary_tensor + alpha * (1-binary_tensor)
# a = tf.sqrt(fixedPointVar / (keep_prob *((1-keep_prob) * tf.pow(alpha-fixedPointMean,2) + fixedPointVar)))
# b = fixedPointMean - a * (keep_prob * fixedPointMean + (1 - keep_prob) * alpha)
# ret = a * ret + b
# ret.set_shape(x.get_shape())
# return ret
# with ops.name_scope(name, "dropout", [x]) as name:
# return utils.smart_cond(training,
# lambda: dropout_selu_impl(x, rate, alpha, noise_shape, seed, name),
# lambda: array_ops.identity(x))
In [3]:
# """"""""""""""""""""""""""""""""""""""Dropout to a value with rescaling."""""""""""""""""""""""""""""""""""""""
# NumPy implementation
def dropout_selu(X, p_dropout):
alpha= -1.7580993408473766
fixedPointMean=0.0
fixedPointVar=1.0
keep_prob = 1.0 - p_dropout
# noise_shape = noise_shape.reshape(X)
random_tensor = keep_prob
# random_tensor += random_ops.random_uniform(noise_shape, seed=seed, dtype=x.dtype)
random_tensor += np.random.uniform(size=X.shape) # low=0, high=1
# binary_tensor = math_ops.floor(random_tensor)
binary_tensor = np.floor(random_tensor)
ret = X * binary_tensor + alpha * (1-binary_tensor)
# a = tf.sqrt(fixedPointVar / (keep_prob *((1-keep_prob) * tf.pow(alpha-fixedPointMean,2) + fixedPointVar)))
a = np.sqrt(fixedPointVar / (keep_prob *((1-keep_prob) * ((alpha-fixedPointMean)**2) + fixedPointVar)))
b = fixedPointMean - a * (keep_prob * fixedPointMean + (1 - keep_prob) * alpha)
ret = a * ret + b
# ret.set_shape(x.get_shape())
ret = ret.reshape(X.shape)
return ret
In [5]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200)) # their initialization scheme
x = x @ w
x = selu(x=x)
x = dropout_selu(X=x, p_dropout=0.10)
mean = x.mean(axis=1)
scale = x.std(axis=1) # standard deviation=square-root(variance)
print(mean.min(), mean.max(), scale.min(), scale.max())
In [6]:
# My NumPy implemetation of Normal dropout for ReLU
def dropout_forward(X, p_dropout):
u = np.random.binomial(1, p_dropout, size=X.shape) / p_dropout
out = X * u
cache = u
return out, cache
def dropout_backward(dout, cache):
dX = dout * cache
return dX
In [7]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200)) # their initialization scheme
x = x @ w
x = selu(x)
x, _ = dropout_forward(p_dropout=0.8, X=x)
mean = x.mean(axis=1)
scale = x.std(axis=1) # standard deviation=square-root(variance)
print(mean.min(), mean.max(), scale.min(), scale.max())
In [6]:
def elu_fwd(X):
alpha = 1.0
scale = 1.0
# return scale * np.where(x>=0.0, x, alpha * (np.exp(x)-1))
X_pos = np.maximum(0.0, X) # ReLU
X_neg = np.minimum(X, 0.0) # otherwise: if X<=0, Exp Leaky ReLU
X_neg_exp = alpha * (np.exp(X_neg)-1) # a: slope, a>=0
out = scale * (X_pos + X_neg_exp)
cache = (scale, alpha, X) # mean=0, std=1
return out, cache
def elu_bwd(dout, cache):
scale, alpha, X = cache # mean=0, std=1
dout = dout * scale
dX_neg = dout.copy()
dX_neg[X>0] = 0
X_neg = np.minimum(X, 0) # otherwise: if X<=0, Exp Leaky ReLU
dX_neg = dX_neg * alpha * np.exp(X_neg) # derivative of abs(np.exp(X_neg)-1) # a: slope, a>=0
dX_pos = dout.copy()
dX_pos[X<0] = 0
dX_pos = dX_pos * 1
dX = dX_neg + dX_pos
return dX
In [7]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200)) # their initialization scheme
x = x @ w
x, _ = elu_fwd(X=x)
x, _ = dropout_forward(p_dropout=0.95, X=x)
mean = x.mean(axis=1)
scale = x.std(axis=1) # standard deviation=square-root(variance)
print(mean.min(), mean.max(), scale.min(), scale.max())
In [13]:
def selu_fwd(X):
alpha = 1.6732632423543772848170429916717
scale = 1.0507009873554804934193349852946
# return scale * np.where(x>=0.0, x, alpha * (np.exp(x)-1))
X_pos = np.maximum(0.0, X) # ReLU
X_neg = np.minimum(X, 0.0) # otherwise: if X<=0, Exp Leaky ReLU
X_neg_exp = alpha * (np.exp(X_neg)-1) # a: slope, a>=0
out = scale * (X_pos + X_neg_exp)
cache = (scale, alpha, X) # mean=0, std=1
return out, cache
def selu_bwd(dout, cache):
scale, alpha, X = cache # mean=0, std=1
dout = dout * scale
dX_neg = dout.copy()
dX_neg[X>0] = 0
X_neg = np.minimum(X, 0) # otherwise: if X<=0, Exp Leaky ReLU
dX_neg = dX_neg * alpha * np.exp(X_neg) # derivative of abs(np.exp(X_neg)-1) # a: slope, a>=0
dX_pos = dout.copy()
dX_pos[X<0] = 0
dX_pos = dX_pos * 1
dX = dX_neg + dX_pos
return dX
# def dropout_selu_forward(X, p_dropout):
def dropout_selu_forward(X, keep_prob):
alpha= -1.7580993408473766
fixedPointMean=0.0
fixedPointVar=1.0
u = np.random.binomial(1, keep_prob, size=X.shape) / keep_prob
out = X * u + alpha * (1-u)
# a = tf.sqrt(fixedPointVar / (keep_prob *((1-keep_prob) * tf.pow(alpha-fixedPointMean,2) + fixedPointVar)))
a = np.sqrt(fixedPointVar / (keep_prob *((1-keep_prob) * (alpha-fixedPointMean)**2 + fixedPointVar)))
b = fixedPointMean - a * (keep_prob * fixedPointMean + (1 - keep_prob) * alpha)
out = a * out + b
cache = a, u
return out, cache
def dropout_selu_backward(dout, cache):
a, u = cache
dout = dout * a
dX = dout * u
return dX
In [14]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200)) # their initialization scheme
x = x @ w
x, cache = selu_fwd(x)
x, _ = dropout_selu_forward(keep_prob=0.95, X=x)
mean = x.mean(axis=1)
scale = x.std(axis=1) # standard deviation=square-root(variance)
print(mean.min(), mean.max(), scale.min(), scale.max())
In [18]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200)) # their initialization scheme
x = x @ w
x, cache = selu_fwd(x)
x = dropout_selu(X=x, p_dropout=0.10)
mean = x.mean(axis=1)
scale = x.std(axis=1) # standard deviation=square-root(variance)
print(mean.min(), mean.max(), scale.min(), scale.max())
According to this, even after a 100 layers, mean neuron activations stay fairly close to mean 0 / variance 1 (even the most extreme means/variances are only off by 0.2).
Sepp Hochreiter is amazing: LSTM, meta-learning, SNNN.
I think he has already done a much larger contribution to science than some self-proclaimed pioneers of DL who spend more time on social networks than actually doing any good research.
In [ ]: