There're reasons why deep neural network could work very well, while few people get a promising result or make it possible by simply make their neural network deep.
In this tutorial, we would like to show you some insights of the techniques that researchers find useful in training a deep model, using MXNet and its visualizing tool -- TensorBoard.
Let’s recap some of the relevant issues on training a deep model:
sigmoid or tanh as activation function, the gradient, same as the above, is getting smaller and smaller. Just remind the formula of the parameter updates and the gradient.Here we create a simple MLP for cifar10 dataset and visualize the learning processing through loss/accuracy, and its gradient distributions, by changing its initialization and activation setting.
We adopt MLP as our model and run our experiment in MNIST dataset. Then we'll visualize the weight and gradient of a layer using Monitor in MXNet and Histogram in TensorBoard.
Here's the network structure:
def get_mlp(acti="relu"):
"""
multi-layer perceptron
"""
data = mx.symbol.Variable('data')
fc = mx.symbol.FullyConnected(data = data, name='fc', num_hidden=512)
act = mx.symbol.Activation(data = fc, name='act', act_type=acti)
fc0 = mx.symbol.FullyConnected(data = act, name='fc0', num_hidden=256)
act0 = mx.symbol.Activation(data = fc0, name='act0', act_type=acti)
fc1 = mx.symbol.FullyConnected(data = act0, name='fc1', num_hidden=128)
act1 = mx.symbol.Activation(data = fc1, name='act1', act_type=acti)
fc2 = mx.symbol.FullyConnected(data = act1, name = 'fc2', num_hidden = 64)
act2 = mx.symbol.Activation(data = fc2, name='act2', act_type=acti)
fc3 = mx.symbol.FullyConnected(data = act2, name='fc3', num_hidden=32)
act3 = mx.symbol.Activation(data = fc3, name='act3', act_type=acti)
fc4 = mx.symbol.FullyConnected(data = act3, name='fc4', num_hidden=16)
act4 = mx.symbol.Activation(data = fc4, name='act4', act_type=acti)
fc5 = mx.symbol.FullyConnected(data = act4, name='fc5', num_hidden=10)
mlp = mx.symbol.SoftmaxOutput(data = fc5, name = 'softmax')
return mlp
As you might already notice, we intentionally add more layers than usual, as the vanished gradient problem becomes severer as the network goes deeper.
The weight initialization also has uniform and xavier.
if args.init == 'uniform':
init = mx.init.Uniform(0.1)
if args.init == 'xavier':
init = mx.init.Xavier(factor_type="in", magnitude=2.34)
Note that we intentionally choose a near zero setting in uniform.
We would compare two different activations, sigmoid and relu.
# acti = sigmoid or relu.
act = mx.symbol.Activation(data = fc, name='act', act_type=acti)
In order to monitor the weight and gradient of this network in different settings, we could use MXNet's monitor for logging and TensorBoard for visualization.
Here's a code snippet from train_model.py:
import mxnet as mx
from tensorboard import summary
from tensorboard import FileWriter
# where to keep your TensorBoard logging file
logdir = './logs/'
summary_writer = FileWriter(logdir)
# mx.mon.Monitor's callback
def get_gradient(g):
# get flatten list
grad = g.asnumpy().flatten()
# logging using tensorboard, use histogram type.
s = summary.histogram('fc_backward_weight', grad)
summary_writer.add_summary(s)
return mx.nd.norm(g)/np.sqrt(g.size)
mon = mx.mon.Monitor(int(args.num_examples/args.batch_size), get_gradient, pattern='fc_backward_weight') # get the gradient passed to the first fully-connnected layer.
# training
model.fit(
X = train,
eval_data = val,
eval_metric = eval_metrics,
kvstore = kv,
monitor = mon,
epoch_end_callback = checkpoint)
# close summary_writer
summary_writer.close()
In [3]:
import sys
sys.path.append('./mnist/')
from train_mnist import *
As you've seen, the metrics of fc_backward_weight is so close to zero, and it didn't change a lot during batchs.
2017-01-07 15:44:38,845 Node[0] Batch: 1 fc_backward_weight 5.1907e-07
2017-01-07 15:44:38,846 Node[0] Batch: 1 fc_backward_weight 4.2085e-07
2017-01-07 15:44:38,847 Node[0] Batch: 1 fc_backward_weight 4.31894e-07
2017-01-07 15:44:38,848 Node[0] Batch: 1 fc_backward_weight 5.80652e-07
2017-01-07 15:45:50,199 Node[0] Batch: 4213 fc_backward_weight 5.49988e-07
2017-01-07 15:45:50,200 Node[0] Batch: 4213 fc_backward_weight 5.89305e-07
2017-01-07 15:45:50,201 Node[0] Batch: 4213 fc_backward_weight 3.71941e-07
2017-01-07 15:45:50,202 Node[0] Batch: 4213 fc_backward_weight 8.05085e-07
You might wonder why we have 4 different fc_backward_weight, cause we use 4 cpus.
In [4]:
# Uniform and sigmoid
args = parse_args('uniform', 'uniform_relu')
data_shape = (784, )
net = get_mlp("relu")
# train
train_model.fit(args, net, get_iterator(data_shape))
Even we have a "poor" initialization, the model could still converge quickly with proper activation function. And its magnitude has significant difference.
2017-01-07 15:54:12,286 Node[0] Batch: 1 fc_backward_weight 0.000267409
2017-01-07 15:54:12,287 Node[0] Batch: 1 fc_backward_weight 0.00031988
2017-01-07 15:54:12,288 Node[0] Batch: 1 fc_backward_weight 0.000306785
2017-01-07 15:54:12,289 Node[0] Batch: 1 fc_backward_weight 0.000347533
2017-01-07 15:55:25,936 Node[0] Batch: 4213 fc_backward_weight 0.0226081
2017-01-07 15:55:25,937 Node[0] Batch: 4213 fc_backward_weight 0.0039793
2017-01-07 15:55:25,937 Node[0] Batch: 4213 fc_backward_weight 0.0306151
2017-01-07 15:55:25,938 Node[0] Batch: 4213 fc_backward_weight 0.00818676
In [4]:
# Xavier and sigmoid
args = parse_args('xavier', 'xavier_sigmoid')
data_shape = (784, )
net = get_mlp("sigmoid")
# train
train_model.fit(args, net, get_iterator(data_shape))
You might find these materials useful:
[1] Rohan #4: The vanishing gradient problem – A Year of Artificial Intelligence
[2] On the difficulty of training recurrent and deep neural networks - YouTube
[3] What is the vanishing gradient problem? - Quora