This looks pretty neat. They can prove that when you slightly modify the ELU activation, your average unit activation goes towards zero mean/unit variance (if the network is deep enough). If they're right, this might make batch norm obsolete, which would be a huge bon to training speeds!
The experiments look convincing, so apparently it even beats BN+ReLU in accuracy... though
I wish they would've shown the resulting distributions of activations after training.
But assuming their fixed point proof is true, it will.
Still, still would've been nice if they'd shown it -- maybe they ran out of space in their appendix ;)
Weirdly, the exact ELU modification they proposed isn't stated explicitly in the paper!
For those wondering, it can be found in the available sourcecode, and looks like this:
In [1]:
# An extra explaination from Reddit
# # Thanks, I will double check the analytical solution. For the numerical one, could you please explain why running the following code results in a value close to 1 rather than 0?
# du = 0.001
# u_old = np.mean(selu(np.random.normal(0, 1, 100000000)))
# u_new = np.mean(selu(np.random.normal(0+du, 1, 100000000)))
# # print (u_new-u_old) / du
# print(u_old, u_new)
# # Now I see your problem:
# # You do not consider the effect of the weights.
# # From one layer to the next, we have two influences:
# # (1) multiplication with weights and
# # (2) applying the SELU.
# # (1) has a centering and symmetrising effect (draws mean towards zero) and
# # (2) has a variance stabilizing effect (draws variance towards 1).
# # That is why we use the variables \mu&\omega and \nu&\tau to analyze the both effects.
# # Oh yes, thats true, zero mean weights completely kill the mean. Thanks!
# Tensorflow implementation
import numpy as np
def selu(x):
alpha = 1.6732632423543772848170429916717
scale = 1.0507009873554804934193349852946
return scale * np.where(x>=0.0, x, alpha * (np.exp(x)-1))
In [4]:
# """"""""""""""""""""""""""""""""""""""Dropout to a value with rescaling."""""""""""""""""""""""""""""""""""""""
# NumPy implementation
def dropout_selu(X, p_dropout):
alpha= -1.7580993408473766
fixedPointMean=0.0
fixedPointVar=1.0
keep_prob = 1.0 - p_dropout
# noise_shape = noise_shape.reshape(X)
random_tensor = keep_prob
# random_tensor += random_ops.random_uniform(noise_shape, seed=seed, dtype=x.dtype)
random_tensor += np.random.uniform(size=X.shape) # low=0, high=1
# binary_tensor = math_ops.floor(random_tensor)
binary_tensor = np.floor(random_tensor)
ret = X * binary_tensor + alpha * (1-binary_tensor)
# a = tf.sqrt(fixedPointVar / (keep_prob *((1-keep_prob) * tf.pow(alpha-fixedPointMean,2) + fixedPointVar)))
a = np.sqrt(fixedPointVar / (keep_prob *((1-keep_prob) * ((alpha-fixedPointMean)**2) + fixedPointVar)))
b = fixedPointMean - a * (keep_prob * fixedPointMean + (1 - keep_prob) * alpha)
ret = a * ret + b
# ret.set_shape(x.get_shape())
ret = ret.reshape(X.shape)
return ret
In [5]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200)) # their initialization scheme
x = x @ w
x = selu(x=x)
x = dropout_selu(X=x, p_dropout=0.10)
mean = x.mean(axis=1)
scale = x.std(axis=1) # standard deviation=square-root(variance)
print(mean.min(), mean.max(), scale.min(), scale.max())
In [6]:
# My NumPy implemetation of Normal dropout for ReLU
def dropout_forward(X, p_dropout):
u = np.random.binomial(1, p_dropout, size=X.shape) / p_dropout
out = X * u
cache = u
return out, cache
def dropout_backward(dout, cache):
dX = dout * cache
return dX
In [7]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200)) # their initialization scheme
x = x @ w
x = selu(x)
x, _ = dropout_forward(p_dropout=0.8, X=x)
mean = x.mean(axis=1)
scale = x.std(axis=1) # standard deviation=square-root(variance)
print(mean.min(), mean.max(), scale.min(), scale.max())
In [6]:
def elu_fwd(X):
alpha = 1.0
scale = 1.0
# return scale * np.where(x>=0.0, x, alpha * (np.exp(x)-1))
X_pos = np.maximum(0.0, X) # ReLU
X_neg = np.minimum(X, 0.0) # otherwise: if X<=0, Exp Leaky ReLU
X_neg_exp = alpha * (np.exp(X_neg)-1) # a: slope, a>=0
out = scale * (X_pos + X_neg_exp)
cache = (scale, alpha, X) # mean=0, std=1
return out, cache
def elu_bwd(dout, cache):
scale, alpha, X = cache # mean=0, std=1
dout = dout * scale
dX_neg = dout.copy()
dX_neg[X>0] = 0
X_neg = np.minimum(X, 0) # otherwise: if X<=0, Exp Leaky ReLU
dX_neg = dX_neg * alpha * np.exp(X_neg) # derivative of abs(np.exp(X_neg)-1) # a: slope, a>=0
dX_pos = dout.copy()
dX_pos[X<0] = 0
dX_pos = dX_pos * 1
dX = dX_neg + dX_pos
return dX
In [7]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200)) # their initialization scheme
x = x @ w
x, _ = elu_fwd(X=x)
x, _ = dropout_forward(p_dropout=0.95, X=x)
mean = x.mean(axis=1)
scale = x.std(axis=1) # standard deviation=square-root(variance)
print(mean.min(), mean.max(), scale.min(), scale.max())
In [7]:
def selu_fwd(X):
alpha = 1.6732632423543772848170429916717
scale = 1.0507009873554804934193349852946
# return scale * np.where(x>=0.0, x, alpha * (np.exp(x)-1))
X_pos = np.maximum(0.0, X) # ReLU
X_neg = np.minimum(X, 0.0) # otherwise: if X<=0, Exp Leaky ReLU
X_neg_exp = alpha * (np.exp(X_neg)-1) # a: slope, a>=0
out = scale * (X_pos + X_neg_exp)
cache = (scale, alpha, X) # mean=0, std=1
return out, cache
def selu_bwd(dout, cache):
scale, alpha, X = cache # mean=0, std=1
dout = dout * scale
dX_neg = dout.copy()
dX_neg[X>0] = 0
X_neg = np.minimum(X, 0) # otherwise: if X<=0, Exp Leaky ReLU
dX_neg = dX_neg * alpha * np.exp(X_neg) # derivative of abs(np.exp(X_neg)-1) # a: slope, a>=0
dX_pos = dout.copy()
dX_pos[X<0] = 0
dX_pos = dX_pos * 1
dX = dX_neg + dX_pos
return dX
# def dropout_selu_forward(X, p_dropout):
def dropout_selu_forward(X, keep_prob):
alpha= -1.7580993408473766
fixedPointMean=0.0
fixedPointVar=1.0
u = np.random.binomial(1, keep_prob, size=X.shape) / keep_prob
out = X * u + alpha * (1-u)
# a = tf.sqrt(fixedPointVar / (keep_prob *((1-keep_prob) * tf.pow(alpha-fixedPointMean,2) + fixedPointVar)))
a = np.sqrt(fixedPointVar / (keep_prob *((1-keep_prob) * (alpha-fixedPointMean)**2 + fixedPointVar)))
b = fixedPointMean - a * (keep_prob * fixedPointMean + (1 - keep_prob) * alpha)
out = a * out + b
cache = a, u
return out, cache
def dropout_selu_backward(dout, cache):
a, u = cache
dout = dout * a
dX = dout * u
return dX
In [14]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200)) # their initialization scheme
x = x @ w
x, cache = selu_fwd(x)
x, _ = dropout_selu_forward(keep_prob=0.95, X=x)
mean = x.mean(axis=1)
scale = x.std(axis=1) # standard deviation=square-root(variance)
print(mean.min(), mean.max(), scale.min(), scale.max())
In [18]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200)) # their initialization scheme
x = x @ w
x, cache = selu_fwd(x)
x = dropout_selu(X=x, p_dropout=0.10)
mean = x.mean(axis=1)
scale = x.std(axis=1) # standard deviation=square-root(variance)
print(mean.min(), mean.max(), scale.min(), scale.max())
In [19]:
alpha = 1.6732632423543772848170429916717
scale = 1.0507009873554804934193349852946
alpha_p = -scale * alpha
def alpha_dropout(h, q):
'''h is activation, q is keep probability'''
mask = np.random.binomial(1, q, size=h.shape)
dropped = mask * h + (1 - mask) * alpha_p
a = 1. / np.sqrt(q + alpha_p ** 2 * q * (1 - q))
b = -a * (1 - q) * alpha_p
return a * dropped + b
def alpha_dropout_fwd(h, q):
'''h is activation, q is keep probability: q=1-p, p=p_dropout, and q=keep_prob'''
alpha = 1.6732632423543772848170429916717
scale = 1.0507009873554804934193349852946
alpha_p = -scale * alpha
mask = np.random.binomial(1, q, size=h.shape)
dropped = mask * h + (1 - mask) * alpha_p
a = 1. / np.sqrt(q + alpha_p ** 2 * q * (1 - q))
b = -a * (1 - q) * alpha_p
out = a * dropped + b
cache = (a, mask)
return out, cache
def alpha_dropout_bwd(dout, cache):
a, mask = cache
d_dropped = dout * a
dh = d_dropped * mask
return dh
In [21]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
m, s = [], []
x = np.random.normal(size=(300, 200))
for _ in range(100):
w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200)) # their initialization scheme
x = x @ w
x, cache = selu_fwd(x)
# x, _ = dropout_selu_forward(keep_prob=0.95, X=x)
# x = alpha_dropout(h=x, q=0.50) # q=1-p_dropout
x, _ = alpha_dropout_fwd(h=x, q=0.50) # q=1-p_dropout
mean = x.mean(axis=1)
scale = x.std(axis=1) # standard deviation=square-root(variance)
print(mean.min(), mean.max(), scale.min(), scale.max())
m.append(mean.min())
m.append(mean.max())
s.append(scale.min())
s.append(scale.max())
In [22]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.plot(m, label='mean')
plt.plot(s, label='scale')
plt.legend()
plt.show()
According to this, even after a 100 layers, mean neuron activations stay fairly close to mean 0 / variance 1 (even the most extreme means/variances are only off by 0.2).
Sepp Hochreiter is amazing: LSTM, meta-learning, SNNN.
I think he has already done a much larger contribution to science than some self-proclaimed pioneers of DL who spend more time on social networks than actually doing any good research.
In [ ]: