There're reasons why deep neural network could work very well, while few people get a promising result or make it possible by simply make their neural network deep.
In this tutorial, we would like to show you some insights of the techniques that researchers find useful in training a deep model, using MXNet and its visualizing tool -- TensorBoard.
Let’s recap some of the relevant issues on training a deep model:
sigmoid or tanh as activation function, the gradient, same as the above, is getting smaller and smaller. Just remind the formula of the parameter updates and the gradient.
In [1]:
def download(data_dir):
if not os.path.isdir(data_dir):
os.system('mkdir ' + data_dir)
os.chdir(data_dir)
if (not os.path.exists('train-images-idx3-ubyte')) or \
(not os.path.exists('train-labels-idx1-ubyte')) or \
(not os.path.exists('t10k-images-idx3-ubyte')) or \
(not os.path.exists('t10k-labels-idx1-ubyte')):
os.system('wget http://data.mxnet.io/mxnet/data/mnist.zip')
os.system('unzip mnist.zip; rm mnist.zip')
os.chdir('..')
In [2]:
def get_iterator(data_shape):
def get_iterator_impl(args, kv):
data_dir = args.data_dir
# if Windows
if os.name == "nt":
data_dir = data_dir[:-1] + "\\"
if '://' not in args.data_dir:
download(data_dir)
flat = False if len(data_shape) == 3 else True
train = mx.io.MNISTIter(
image = data_dir + "train-images-idx3-ubyte",
label = data_dir + "train-labels-idx1-ubyte",
input_shape = data_shape,
batch_size = args.batch_size,
shuffle = True,
flat = flat,
num_parts = kv.num_workers,
part_index = kv.rank)
val = mx.io.MNISTIter(
image = data_dir + "t10k-images-idx3-ubyte",
label = data_dir + "t10k-labels-idx1-ubyte",
input_shape = data_shape,
batch_size = args.batch_size,
flat = flat,
num_parts = kv.num_workers,
part_index = kv.rank)
return (train, val)
return get_iterator_impl
In [3]:
def get_mlp(acti="relu"):
"""
multi-layer perceptron
"""
data = mx.symbol.Variable('data')
fc = mx.symbol.FullyConnected(data = data, name='fc', num_hidden=512)
act = mx.symbol.Activation(data = fc, name='act', act_type=acti)
fc0 = mx.symbol.FullyConnected(data = act, name='fc0', num_hidden=256)
act0 = mx.symbol.Activation(data = fc0, name='act0', act_type=acti)
fc1 = mx.symbol.FullyConnected(data = act0, name='fc1', num_hidden=128)
act1 = mx.symbol.Activation(data = fc1, name='act1', act_type=acti)
fc2 = mx.symbol.FullyConnected(data = act1, name = 'fc2', num_hidden = 64)
act2 = mx.symbol.Activation(data = fc2, name='act2', act_type=acti)
fc3 = mx.symbol.FullyConnected(data = act2, name='fc3', num_hidden=32)
act3 = mx.symbol.Activation(data = fc3, name='act3', act_type=acti)
fc4 = mx.symbol.FullyConnected(data = act3, name='fc4', num_hidden=16)
act4 = mx.symbol.Activation(data = fc4, name='act4', act_type=acti)
fc5 = mx.symbol.FullyConnected(data = act4, name='fc5', num_hidden=10)
mlp = mx.symbol.SoftmaxOutput(data = fc5, name = 'softmax')
return mlp
As you might already notice, we intentionally add more layers than usual, as the vanished gradient problem becomes severer as the network goes deeper.
Here we create a simple MLP for cifar10 dataset and visualize the learning processing through loss/accuracy, and its gradient distributions, by changing its initialization and activation setting.
We adopt MLP as our model and run our experiment in MNIST dataset. Then we'll visualize the weight and gradient of a layer using Monitor in MXNet and Histogram in TensorBoard.
The weight initialization also has uniform and xavier.
if args.init == 'uniform':
init = mx.init.Uniform(0.1)
if args.init == 'xavier':
init = mx.init.Xavier(factor_type="in", magnitude=2.34)
Note that we intentionally choose a near zero setting in uniform.
We would compare two different activations, sigmoid and relu.
# acti = sigmoid or relu.
act = mx.symbol.Activation(data = fc, name='act', act_type=acti)
In order to monitor the weight and gradient of this network in different settings, we could use MXNet's monitor for logging and TensorBoard for visualization.
Here's a code snippet from train_model.py:
import mxnet as mx
from tensorboard import summary
from tensorboard import FileWriter
# where to keep your TensorBoard logging file
logdir = './logs/'
summary_writer = FileWriter(logdir)
# mx.mon.Monitor's callback
def get_gradient(g):
# get flatten list
grad = g.asnumpy().flatten()
# logging using tensorboard, use histogram type.
s = summary.histogram('fc_backward_weight', grad)
summary_writer.add_summary(s)
return mx.nd.norm(g)/np.sqrt(g.size)
mon = mx.mon.Monitor(int(args.num_examples/args.batch_size), get_gradient, pattern='fc_backward_weight') # get the gradient passed to the first fully-connnected layer.
# training
model.fit(
X = train,
eval_data = val,
eval_metric = eval_metrics,
kvstore = kv,
monitor = mon,
epoch_end_callback = checkpoint)
# close summary_writer
summary_writer.close()
In [4]:
import mxnet as mx
import argparse
import os, sys
def parse_args(init_type, name):
parser = argparse.ArgumentParser(description='train an image classifer on mnist')
parser.add_argument('--network', type=str, default='mlp',
choices = ['mlp', 'lenet', 'lenet-stn'],
help = 'the cnn to use')
parser.add_argument('--data-dir', type=str, default='mnist/',
help='the input data directory')
parser.add_argument('--gpus', type=str,
help='the gpus will be used, e.g "0,1,2,3"')
parser.add_argument('--num-examples', type=int, default=60000,
help='the number of training examples')
parser.add_argument('--batch-size', type=int, default=128,
help='the batch size')
parser.add_argument('--lr', type=float, default=.1,
help='the initial learning rate')
parser.add_argument('--model-prefix', type=str,
help='the prefix of the model to load/save')
parser.add_argument('--save-model-prefix', type=str,
help='the prefix of the model to save')
parser.add_argument('--num-epochs', type=int, default=10,
help='the number of training epochs')
parser.add_argument('--load-epoch', type=int,
help="load the model on an epoch using the model-prefix")
parser.add_argument('--kv-store', type=str, default='local',
help='the kvstore type')
parser.add_argument('--lr-factor', type=float, default=1,
help='times the lr with a factor for every lr-factor-epoch epoch')
parser.add_argument('--lr-factor-epoch', type=float, default=1,
help='the number of epoch to factor the lr, could be .5')
parser.add_argument('--init', type=str, default=init_type,
help='the weight initialization method')
parser.add_argument('--name', type=str, default=name,
help='name for summary.histogram for gradient/weight logging')
return parser.parse_args("")
In [5]:
import mxnet as mx
import logging
import os
import numpy as np
from tensorboard import summary
from tensorboard import FileWriter
def fit(args, network, data_loader, batch_end_callback=None):
# kvstore
kv = mx.kvstore.create(args.kv_store)
# logging
head = '%(asctime)-15s Node[' + str(kv.rank) + '] %(message)s'
if 'log_file' in args and args.log_file is not None:
log_file = args.log_file
log_dir = args.log_dir
log_file_full_name = os.path.join(log_dir, log_file)
if not os.path.exists(log_dir):
os.mkdir(log_dir)
logger = logging.getLogger()
handler = logging.FileHandler(log_file_full_name)
formatter = logging.Formatter(head)
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.DEBUG)
logger.info('start with arguments %s', args)
else:
logging.basicConfig(level=logging.DEBUG, format=head)
logging.info('start with arguments %s', args)
# load model
model_prefix = args.model_prefix
if model_prefix is not None:
model_prefix += "-%d" % (kv.rank)
model_args = {}
if args.load_epoch is not None:
assert model_prefix is not None
tmp = mx.model.FeedForward.load(model_prefix, args.load_epoch)
model_args = {'arg_params' : tmp.arg_params,
'aux_params' : tmp.aux_params,
'begin_epoch' : args.load_epoch}
# TODO: check epoch_size for 'dist_sync'
epoch_size = args.num_examples / args.batch_size
model_args['begin_num_update'] = epoch_size * args.load_epoch
# save model
save_model_prefix = args.save_model_prefix
if save_model_prefix is None:
save_model_prefix = model_prefix
checkpoint = None if save_model_prefix is None else mx.callback.do_checkpoint(save_model_prefix)
# data
(train, val) = data_loader(args, kv)
# train
devs = [mx.cpu(i) for i in range(4)] if args.gpus is None else [
mx.gpu(int(i)) for i in args.gpus.split(',')]
epoch_size = args.num_examples / args.batch_size
if args.kv_store == 'dist_sync':
epoch_size /= kv.num_workers
model_args['epoch_size'] = epoch_size
if 'lr_factor' in args and args.lr_factor < 1:
model_args['lr_scheduler'] = mx.lr_scheduler.FactorScheduler(
step = max(int(epoch_size * args.lr_factor_epoch), 1),
factor = args.lr_factor)
if 'clip_gradient' in args and args.clip_gradient is not None:
model_args['clip_gradient'] = args.clip_gradient
# disable kvstore for single device
if 'local' in kv.type and (
args.gpus is None or len(args.gpus.split(',')) is 1):
kv = None
if args.init == 'uniform':
init = mx.init.Uniform(0.1)
if args.init == 'normal':
init = mx.init.Normal(0,0.1)
if args.init == 'xavier':
init = mx.init.Xavier(factor_type="in", magnitude=2.34)
model = mx.model.FeedForward(
ctx = devs,
symbol = network,
num_epoch = args.num_epochs,
learning_rate = args.lr,
momentum = 0.9,
wd = 0.00001,
initializer = init,
**model_args)
eval_metrics = ['accuracy']
## TopKAccuracy only allows top_k > 1
for top_k in [5]:
eval_metrics.append(mx.metric.create('top_k_accuracy', top_k = top_k))
if batch_end_callback is not None:
if not isinstance(batch_end_callback, list):
batch_end_callback = [batch_end_callback]
else:
batch_end_callback = []
batch_end_callback.append(mx.callback.Speedometer(args.batch_size, 50))
logdir = './logs/'
summary_writer = FileWriter(logdir)
def get_grad(g):
# logging using tensorboard
grad = g.asnumpy().flatten()
s = summary.histogram(args.name, grad)
summary_writer.add_summary(s)
return mx.nd.norm(g)/np.sqrt(g.size)
mon = mx.mon.Monitor(int(args.num_examples/args.batch_size), get_grad, pattern='fc_backward_weight') # get weight of first fully-connnected layer
model.fit(
X = train,
eval_data = val,
eval_metric = eval_metrics,
kvstore = kv,
monitor = mon,
epoch_end_callback = checkpoint)
summary_writer.close()
In [6]:
# Uniform and sigmoid
args = parse_args('uniform', 'uniform_sigmoid')
data_shape = (784, )
net = get_mlp("sigmoid")
# train
fit(args, net, get_iterator(data_shape))
As you've seen, the metrics of fc_backward_weight is so close to zero, and it didn't change a lot during batchs.
2017-01-07 15:44:38,845 Node[0] Batch: 1 fc_backward_weight 5.1907e-07
2017-01-07 15:44:38,846 Node[0] Batch: 1 fc_backward_weight 4.2085e-07
2017-01-07 15:44:38,847 Node[0] Batch: 1 fc_backward_weight 4.31894e-07
2017-01-07 15:44:38,848 Node[0] Batch: 1 fc_backward_weight 5.80652e-07
2017-01-07 15:45:50,199 Node[0] Batch: 4213 fc_backward_weight 5.49988e-07
2017-01-07 15:45:50,200 Node[0] Batch: 4213 fc_backward_weight 5.89305e-07
2017-01-07 15:45:50,201 Node[0] Batch: 4213 fc_backward_weight 3.71941e-07
2017-01-07 15:45:50,202 Node[0] Batch: 4213 fc_backward_weight 8.05085e-07
You might wonder why we have 4 different fc_backward_weight, cause we use 4 cpus.
In [7]:
# Uniform and sigmoid
args = parse_args('uniform', 'uniform_relu')
data_shape = (784, )
net = get_mlp("relu")
# train
fit(args, net, get_iterator(data_shape))
Even we have a "poor" initialization, the model could still converge quickly with proper activation function. And its magnitude has significant difference.
2017-01-07 15:54:12,286 Node[0] Batch: 1 fc_backward_weight 0.000267409
2017-01-07 15:54:12,287 Node[0] Batch: 1 fc_backward_weight 0.00031988
2017-01-07 15:54:12,288 Node[0] Batch: 1 fc_backward_weight 0.000306785
2017-01-07 15:54:12,289 Node[0] Batch: 1 fc_backward_weight 0.000347533
2017-01-07 15:55:25,936 Node[0] Batch: 4213 fc_backward_weight 0.0226081
2017-01-07 15:55:25,937 Node[0] Batch: 4213 fc_backward_weight 0.0039793
2017-01-07 15:55:25,937 Node[0] Batch: 4213 fc_backward_weight 0.0306151
2017-01-07 15:55:25,938 Node[0] Batch: 4213 fc_backward_weight 0.00818676
In [8]:
# Xavier and sigmoid
args = parse_args('xavier', 'xavier_sigmoid')
data_shape = (784, )
net = get_mlp("sigmoid")
# train
fit(args, net, get_iterator(data_shape))
You might find these materials useful:
[1] Rohan #4: The vanishing gradient problem – A Year of Artificial Intelligence
[2] On the difficulty of training recurrent and deep neural networks - YouTube
[3] What is the vanishing gradient problem? - Quora