Understanding the vanishing gradient problem through visualization

There're reasons why deep neural network could work very well, while few people get a promising result or make it possible by simply make their neural network deep.

Computational power and data grow tremendously. People need more complex model and faster computer to make it feasible.
Realize and understand the difficulties associated with training a deep model.

In this tutorial, we would like to show you some insights of the techniques that researchers find useful in training a deep model, using MXNet and its visualizing tool -- TensorBoard.

Let’s recap some of the relevant issues on training a deep model:

Weight initialization. If you initialize the network with random and small weights, when you look at the gradients down the top layer, you would find they’re getting smaller and smaller, then the first layer almost doesn’t change as the gradients are too small to make a significant update. Without a chance to learn the first layer effectively, it's impossible to update and learn a good deep model.
Nonlinearity activation. When people use sigmoid or tanh as activation function, the gradient, same as the above, is getting smaller and smaller. Just remind the formula of the parameter updates and the gradient.

Experiment Setting

Here we create a simple MLP for cifar10 dataset and visualize the learning processing through loss/accuracy, and its gradient distributions, by changing its initialization and activation setting.

General Setting

We adopt MLP as our model and run our experiment in MNIST dataset. Then we'll visualize the weight and gradient of a layer using Monitor in MXNet and Histogram in TensorBoard.

Network Structure

Here's the network structure:

def get_mlp(acti="relu"):
    """
    multi-layer perceptron
    """
    data = mx.symbol.Variable('data')
    fc   = mx.symbol.FullyConnected(data = data, name='fc', num_hidden=512)
    act  = mx.symbol.Activation(data = fc, name='act', act_type=acti)
    fc0  = mx.symbol.FullyConnected(data = act, name='fc0', num_hidden=256)
    act0 = mx.symbol.Activation(data = fc0, name='act0', act_type=acti)
    fc1  = mx.symbol.FullyConnected(data = act0, name='fc1', num_hidden=128)
    act1 = mx.symbol.Activation(data = fc1, name='act1', act_type=acti)
    fc2  = mx.symbol.FullyConnected(data = act1, name = 'fc2', num_hidden = 64)
    act2 = mx.symbol.Activation(data = fc2, name='act2', act_type=acti)
    fc3  = mx.symbol.FullyConnected(data = act2, name='fc3', num_hidden=32)
    act3 = mx.symbol.Activation(data = fc3, name='act3', act_type=acti)
    fc4  = mx.symbol.FullyConnected(data = act3, name='fc4', num_hidden=16)
    act4 = mx.symbol.Activation(data = fc4, name='act4', act_type=acti)
    fc5  = mx.symbol.FullyConnected(data = act4, name='fc5', num_hidden=10)
    mlp  = mx.symbol.SoftmaxOutput(data = fc5, name = 'softmax')
    return mlp

As you might already notice, we intentionally add more layers than usual, as the vanished gradient problem becomes severer as the network goes deeper.

Weight Initialization

The weight initialization also has uniform and xavier.

if args.init == 'uniform':
        init = mx.init.Uniform(0.1)
if args.init == 'xavier':
    init = mx.init.Xavier(factor_type="in", magnitude=2.34)

Note that we intentionally choose a near zero setting in uniform.

Activation Function

We would compare two different activations, sigmoid and relu.

# acti = sigmoid or relu.
act  = mx.symbol.Activation(data = fc, name='act', act_type=acti)

Logging with TensorBoard and Monitor

In order to monitor the weight and gradient of this network in different settings, we could use MXNet's monitor for logging and TensorBoard for visualization.

Usage

Here's a code snippet from train_model.py:

import mxnet as mx
from tensorboard import summary
from tensorboard import FileWriter

# where to keep your TensorBoard logging file
logdir = './logs/'
summary_writer = FileWriter(logdir)

# mx.mon.Monitor's callback 
def get_gradient(g):
    # get flatten list
    grad = g.asnumpy().flatten()
    # logging using tensorboard, use histogram type.
    s = summary.histogram('fc_backward_weight', grad)
    summary_writer.add_summary(s)
    return mx.nd.norm(g)/np.sqrt(g.size)

mon = mx.mon.Monitor(int(args.num_examples/args.batch_size), get_gradient, pattern='fc_backward_weight')  # get the gradient passed to the first fully-connnected layer.

# training
model.fit(
        X                  = train,
        eval_data          = val,
        eval_metric        = eval_metrics,
        kvstore            = kv,
        monitor            = mon,
        epoch_end_callback = checkpoint)

# close summary_writer
summary_writer.close()



In [3]:

    
import sys
sys.path.append('./mnist/')
from train_mnist import *

What to expect?

If a setting suffers from an vanish gradient problem, the gradients passed from the top should be very close to zero, and the weight of the network barely change/update.

Uniform and Sigmoid

Uniform and sigmoid

args = parse_args('uniform', 'uniform_sigmoid') data_shape = (784, ) net = get_mlp("sigmoid")

train

train_model.fit(args, net, get_iterator(data_shape))

As you've seen, the metrics of fc_backward_weight is so close to zero, and it didn't change a lot during batchs.

2017-01-07 15:44:38,845 Node[0] Batch:       1 fc_backward_weight             5.1907e-07    
2017-01-07 15:44:38,846 Node[0] Batch:       1 fc_backward_weight             4.2085e-07    
2017-01-07 15:44:38,847 Node[0] Batch:       1 fc_backward_weight             4.31894e-07   
2017-01-07 15:44:38,848 Node[0] Batch:       1 fc_backward_weight             5.80652e-07

2017-01-07 15:45:50,199 Node[0] Batch:    4213 fc_backward_weight             5.49988e-07   
2017-01-07 15:45:50,200 Node[0] Batch:    4213 fc_backward_weight             5.89305e-07   
2017-01-07 15:45:50,201 Node[0] Batch:    4213 fc_backward_weight             3.71941e-07   
2017-01-07 15:45:50,202 Node[0] Batch:    4213 fc_backward_weight             8.05085e-07

You might wonder why we have 4 different fc_backward_weight, cause we use 4 cpus.

Uniform and ReLu



In [4]:

    
# Uniform and sigmoid
args = parse_args('uniform', 'uniform_relu')
data_shape = (784, )
net = get_mlp("relu")

# train
train_model.fit(args, net, get_iterator(data_shape))









    



2017-03-11 11:04:11,110 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus=None, init='uniform', kv_store='local', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, name='uniform_relu', network='mlp', num_epochs=10, num_examples=60000, save_model_prefix=None)
./mnist/train_model.py:93: DeprecationWarning: mxnet.model.FeedForward has been deprecated. Please use mxnet.mod.Module instead.
  **model_args)
/home/baomingkun/mxnet/python/mxnet/model.py:516: DeprecationWarning: Calling initializer with init(str, NDArray) has been deprecated.please use init(mx.init.InitDesc(...), NDArray) instead.
  self.initializer(k, v)
2017-03-11 11:04:12,619 Node[0] Start training with [cpu(0), cpu(1), cpu(2), cpu(3)]
2017-03-11 11:04:14,613 Node[0] Batch:       1 fc_backward_weight             0.00025978	
2017-03-11 11:04:14,614 Node[0] Batch:       1 fc_backward_weight             0.000253863	
2017-03-11 11:04:14,615 Node[0] Batch:       1 fc_backward_weight             0.000261572	
2017-03-11 11:04:14,615 Node[0] Batch:       1 fc_backward_weight             0.000264203	
2017-03-11 11:05:05,561 Node[0] Epoch[0] Resetting Data Iterator
2017-03-11 11:05:05,588 Node[0] Epoch[0] Time cost=52.952
2017-03-11 11:05:09,414 Node[0] Epoch[0] Validation-accuracy=0.647736
2017-03-11 11:05:09,415 Node[0] Epoch[0] Validation-top_k_accuracy_5=0.977764
2017-03-11 11:05:11,438 Node[0] Batch:     469 fc_backward_weight             0.105897	
2017-03-11 11:05:11,439 Node[0] Batch:     469 fc_backward_weight             0.0624352	
2017-03-11 11:05:11,440 Node[0] Batch:     469 fc_backward_weight             0.0555878	
2017-03-11 11:05:11,441 Node[0] Batch:     469 fc_backward_weight             0.0884731	
2017-03-11 11:06:00,458 Node[0] Epoch[1] Resetting Data Iterator
2017-03-11 11:06:00,459 Node[0] Epoch[1] Time cost=51.043
2017-03-11 11:06:03,780 Node[0] Epoch[1] Validation-accuracy=0.919571
2017-03-11 11:06:03,781 Node[0] Epoch[1] Validation-top_k_accuracy_5=0.990885
2017-03-11 11:06:05,734 Node[0] Batch:     937 fc_backward_weight             0.07199	
2017-03-11 11:06:05,734 Node[0] Batch:     937 fc_backward_weight             0.0641191	
2017-03-11 11:06:05,735 Node[0] Batch:     937 fc_backward_weight             0.0308417	
2017-03-11 11:06:05,736 Node[0] Batch:     937 fc_backward_weight             0.0293758	
2017-03-11 11:06:50,659 Node[0] Epoch[2] Resetting Data Iterator
2017-03-11 11:06:50,661 Node[0] Epoch[2] Time cost=46.878
2017-03-11 11:06:53,987 Node[0] Epoch[2] Validation-accuracy=0.938602
2017-03-11 11:06:53,988 Node[0] Epoch[2] Validation-top_k_accuracy_5=0.992188
2017-03-11 11:06:56,319 Node[0] Batch:    1405 fc_backward_weight             0.0588353	
2017-03-11 11:06:56,320 Node[0] Batch:    1405 fc_backward_weight             0.034276	
2017-03-11 11:06:56,320 Node[0] Batch:    1405 fc_backward_weight             0.0295786	
2017-03-11 11:06:56,321 Node[0] Batch:    1405 fc_backward_weight             0.0458145	
2017-03-11 11:07:43,490 Node[0] Epoch[3] Resetting Data Iterator
2017-03-11 11:07:43,492 Node[0] Epoch[3] Time cost=49.501
2017-03-11 11:07:46,766 Node[0] Epoch[3] Validation-accuracy=0.943610
2017-03-11 11:07:46,767 Node[0] Epoch[3] Validation-top_k_accuracy_5=0.991987
2017-03-11 11:07:48,762 Node[0] Batch:    1873 fc_backward_weight             0.0382837	
2017-03-11 11:07:48,763 Node[0] Batch:    1873 fc_backward_weight             0.0812797	
2017-03-11 11:07:48,764 Node[0] Batch:    1873 fc_backward_weight             0.0159783	
2017-03-11 11:07:48,765 Node[0] Batch:    1873 fc_backward_weight             0.100598	
2017-03-11 11:08:34,846 Node[0] Epoch[4] Resetting Data Iterator
2017-03-11 11:08:34,848 Node[0] Epoch[4] Time cost=48.079
2017-03-11 11:08:37,212 Node[0] Epoch[4] Validation-accuracy=0.945913
2017-03-11 11:08:37,213 Node[0] Epoch[4] Validation-top_k_accuracy_5=0.992788
2017-03-11 11:08:39,226 Node[0] Batch:    2341 fc_backward_weight             0.0264015	
2017-03-11 11:08:39,227 Node[0] Batch:    2341 fc_backward_weight             0.0127507	
2017-03-11 11:08:39,228 Node[0] Batch:    2341 fc_backward_weight             0.0205804	
2017-03-11 11:08:39,229 Node[0] Batch:    2341 fc_backward_weight             0.177614	
2017-03-11 11:09:27,739 Node[0] Epoch[5] Resetting Data Iterator
2017-03-11 11:09:27,748 Node[0] Epoch[5] Time cost=50.534
2017-03-11 11:09:31,637 Node[0] Epoch[5] Validation-accuracy=0.942608
2017-03-11 11:09:31,638 Node[0] Epoch[5] Validation-top_k_accuracy_5=0.991687
2017-03-11 11:09:33,613 Node[0] Batch:    2809 fc_backward_weight             0.0476202	
2017-03-11 11:09:33,614 Node[0] Batch:    2809 fc_backward_weight             0.042855	
2017-03-11 11:09:33,614 Node[0] Batch:    2809 fc_backward_weight             0.00572376	
2017-03-11 11:09:33,615 Node[0] Batch:    2809 fc_backward_weight             0.0718891	
2017-03-11 11:10:18,426 Node[0] Epoch[6] Resetting Data Iterator
2017-03-11 11:10:18,428 Node[0] Epoch[6] Time cost=46.787
2017-03-11 11:10:21,989 Node[0] Epoch[6] Validation-accuracy=0.951623
2017-03-11 11:10:21,990 Node[0] Epoch[6] Validation-top_k_accuracy_5=0.993289
2017-03-11 11:10:23,962 Node[0] Batch:    3277 fc_backward_weight             0.0519988	
2017-03-11 11:10:23,963 Node[0] Batch:    3277 fc_backward_weight             0.0619877	
2017-03-11 11:10:23,964 Node[0] Batch:    3277 fc_backward_weight             0.00864718	
2017-03-11 11:10:23,965 Node[0] Batch:    3277 fc_backward_weight             0.0152405	
2017-03-11 11:11:14,594 Node[0] Epoch[7] Resetting Data Iterator
2017-03-11 11:11:14,597 Node[0] Epoch[7] Time cost=52.606
2017-03-11 11:11:16,479 Node[0] Epoch[7] Validation-accuracy=0.948518
2017-03-11 11:11:16,480 Node[0] Epoch[7] Validation-top_k_accuracy_5=0.991987
2017-03-11 11:11:18,497 Node[0] Batch:    3745 fc_backward_weight             0.0488965	
2017-03-11 11:11:18,498 Node[0] Batch:    3745 fc_backward_weight             0.030266	
2017-03-11 11:11:18,499 Node[0] Batch:    3745 fc_backward_weight             0.00161156	
2017-03-11 11:11:18,500 Node[0] Batch:    3745 fc_backward_weight             0.0557319	
2017-03-11 11:12:03,253 Node[0] Epoch[8] Resetting Data Iterator
2017-03-11 11:12:03,255 Node[0] Epoch[8] Time cost=46.773
2017-03-11 11:12:06,745 Node[0] Epoch[8] Validation-accuracy=0.956530
2017-03-11 11:12:06,746 Node[0] Epoch[8] Validation-top_k_accuracy_5=0.992488
2017-03-11 11:12:08,693 Node[0] Batch:    4213 fc_backward_weight             0.0313733	
2017-03-11 11:12:08,693 Node[0] Batch:    4213 fc_backward_weight             0.0698454	
2017-03-11 11:12:08,694 Node[0] Batch:    4213 fc_backward_weight             0.0036761	
2017-03-11 11:12:08,695 Node[0] Batch:    4213 fc_backward_weight             0.0358481	
2017-03-11 11:12:56,282 Node[0] Epoch[9] Resetting Data Iterator
2017-03-11 11:12:56,287 Node[0] Epoch[9] Time cost=49.538
2017-03-11 11:12:59,478 Node[0] Epoch[9] Validation-accuracy=0.947115
2017-03-11 11:12:59,479 Node[0] Epoch[9] Validation-top_k_accuracy_5=0.993289

Even we have a "poor" initialization, the model could still converge quickly with proper activation function. And its magnitude has significant difference.

2017-01-07 15:54:12,286 Node[0] Batch:       1 fc_backward_weight             0.000267409   
2017-01-07 15:54:12,287 Node[0] Batch:       1 fc_backward_weight             0.00031988    
2017-01-07 15:54:12,288 Node[0] Batch:       1 fc_backward_weight             0.000306785   
2017-01-07 15:54:12,289 Node[0] Batch:       1 fc_backward_weight             0.000347533

2017-01-07 15:55:25,936 Node[0] Batch:    4213 fc_backward_weight             0.0226081 
2017-01-07 15:55:25,937 Node[0] Batch:    4213 fc_backward_weight             0.0039793 
2017-01-07 15:55:25,937 Node[0] Batch:    4213 fc_backward_weight             0.0306151 
2017-01-07 15:55:25,938 Node[0] Batch:    4213 fc_backward_weight             0.00818676

Xavier and Sigmoid



In [4]:

    
# Xavier and sigmoid
args = parse_args('xavier', 'xavier_sigmoid')
data_shape = (784, )
net = get_mlp("sigmoid")

# train
train_model.fit(args, net, get_iterator(data_shape))









    



2017-01-07 15:59:10,021 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus=None, init='xavier', kv_store='local', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, name='xavier_sigmoid', network='mlp', num_epochs=10, num_examples=60000, save_model_prefix=None)
2017-01-07 15:59:13,291 Node[0] [Deprecation Warning] mxnet.model.FeedForward has been deprecated. Please use mxnet.mod.Module instead.
2017-01-07 15:59:13,299 Node[0] Start training with [cpu(0), cpu(1), cpu(2), cpu(3)]
2017-01-07 15:59:15,909 Node[0] Batch:       1 fc_backward_weight             9.27798e-06	
2017-01-07 15:59:15,909 Node[0] Batch:       1 fc_backward_weight             8.58008e-06	
2017-01-07 15:59:15,910 Node[0] Batch:       1 fc_backward_weight             8.96261e-06	
2017-01-07 15:59:15,911 Node[0] Batch:       1 fc_backward_weight             7.33611e-06	
2017-01-07 15:59:20,779 Node[0] Epoch[0] Resetting Data Iterator
2017-01-07 15:59:20,780 Node[0] Epoch[0] Time cost=7.433
2017-01-07 15:59:21,086 Node[0] Epoch[0] Validation-accuracy=0.105769
2017-01-07 15:59:21,087 Node[0] Epoch[0] Validation-top_k_accuracy_5=0.509115
2017-01-07 15:59:23,778 Node[0] Batch:     469 fc_backward_weight             6.76125e-06	
2017-01-07 15:59:23,779 Node[0] Batch:     469 fc_backward_weight             6.54805e-06	
2017-01-07 15:59:23,780 Node[0] Batch:     469 fc_backward_weight             6.80302e-06	
2017-01-07 15:59:23,782 Node[0] Batch:     469 fc_backward_weight             7.39115e-06	
2017-01-07 15:59:29,174 Node[0] Epoch[1] Resetting Data Iterator
2017-01-07 15:59:29,175 Node[0] Epoch[1] Time cost=8.087
2017-01-07 15:59:29,477 Node[0] Epoch[1] Validation-accuracy=0.105769
2017-01-07 15:59:29,477 Node[0] Epoch[1] Validation-top_k_accuracy_5=0.504507
2017-01-07 15:59:32,143 Node[0] Batch:     937 fc_backward_weight             5.83071e-06	
2017-01-07 15:59:32,144 Node[0] Batch:     937 fc_backward_weight             5.59626e-06	
2017-01-07 15:59:32,145 Node[0] Batch:     937 fc_backward_weight             5.776e-06	
2017-01-07 15:59:32,147 Node[0] Batch:     937 fc_backward_weight             6.28738e-06	
2017-01-07 15:59:37,783 Node[0] Epoch[2] Resetting Data Iterator
2017-01-07 15:59:37,784 Node[0] Epoch[2] Time cost=8.305
2017-01-07 15:59:38,085 Node[0] Epoch[2] Validation-accuracy=0.105769
2017-01-07 15:59:38,086 Node[0] Epoch[2] Validation-top_k_accuracy_5=0.510216
2017-01-07 15:59:41,031 Node[0] Batch:    1405 fc_backward_weight             4.951e-06	
2017-01-07 15:59:41,032 Node[0] Batch:    1405 fc_backward_weight             4.72836e-06	
2017-01-07 15:59:41,033 Node[0] Batch:    1405 fc_backward_weight             4.8514e-06	
2017-01-07 15:59:41,034 Node[0] Batch:    1405 fc_backward_weight             5.26915e-06	
2017-01-07 15:59:47,042 Node[0] Epoch[3] Resetting Data Iterator
2017-01-07 15:59:47,043 Node[0] Epoch[3] Time cost=8.957
2017-01-07 15:59:47,424 Node[0] Epoch[3] Validation-accuracy=0.105769
2017-01-07 15:59:47,425 Node[0] Epoch[3] Validation-top_k_accuracy_5=0.509014
2017-01-07 15:59:50,295 Node[0] Batch:    1873 fc_backward_weight             4.22193e-06	
2017-01-07 15:59:50,296 Node[0] Batch:    1873 fc_backward_weight             4.03044e-06	
2017-01-07 15:59:50,297 Node[0] Batch:    1873 fc_backward_weight             4.11877e-06	
2017-01-07 15:59:50,298 Node[0] Batch:    1873 fc_backward_weight             4.45402e-06	
2017-01-07 15:59:56,082 Node[0] Epoch[4] Resetting Data Iterator
2017-01-07 15:59:56,083 Node[0] Epoch[4] Time cost=8.653
2017-01-07 15:59:56,378 Node[0] Epoch[4] Validation-accuracy=0.105769
2017-01-07 15:59:56,379 Node[0] Epoch[4] Validation-top_k_accuracy_5=0.509014
2017-01-07 15:59:58,837 Node[0] Batch:    2341 fc_backward_weight             3.64564e-06	
2017-01-07 15:59:58,838 Node[0] Batch:    2341 fc_backward_weight             3.48901e-06	
2017-01-07 15:59:58,839 Node[0] Batch:    2341 fc_backward_weight             3.55765e-06	
2017-01-07 15:59:58,840 Node[0] Batch:    2341 fc_backward_weight             3.82692e-06	
2017-01-07 16:00:03,458 Node[0] Epoch[5] Resetting Data Iterator
2017-01-07 16:00:03,459 Node[0] Epoch[5] Time cost=7.078
2017-01-07 16:00:03,790 Node[0] Epoch[5] Validation-accuracy=0.105769
2017-01-07 16:00:03,791 Node[0] Epoch[5] Validation-top_k_accuracy_5=0.509014
2017-01-07 16:00:06,406 Node[0] Batch:    2809 fc_backward_weight             3.19336e-06	
2017-01-07 16:00:06,407 Node[0] Batch:    2809 fc_backward_weight             3.06777e-06	
2017-01-07 16:00:06,409 Node[0] Batch:    2809 fc_backward_weight             3.12543e-06	
2017-01-07 16:00:06,410 Node[0] Batch:    2809 fc_backward_weight             3.34344e-06	
2017-01-07 16:00:12,052 Node[0] Epoch[6] Resetting Data Iterator
2017-01-07 16:00:12,053 Node[0] Epoch[6] Time cost=8.261
2017-01-07 16:00:12,352 Node[0] Epoch[6] Validation-accuracy=0.107472
2017-01-07 16:00:12,353 Node[0] Epoch[6] Validation-top_k_accuracy_5=0.509014
2017-01-07 16:00:14,968 Node[0] Batch:    3277 fc_backward_weight             2.83478e-06	
2017-01-07 16:00:14,969 Node[0] Batch:    3277 fc_backward_weight             2.73443e-06	
2017-01-07 16:00:14,970 Node[0] Batch:    3277 fc_backward_weight             2.78607e-06	
2017-01-07 16:00:14,971 Node[0] Batch:    3277 fc_backward_weight             2.9644e-06	
2017-01-07 16:00:20,252 Node[0] Epoch[7] Resetting Data Iterator
2017-01-07 16:00:20,253 Node[0] Epoch[7] Time cost=7.899
2017-01-07 16:00:20,541 Node[0] Epoch[7] Validation-accuracy=0.105970
2017-01-07 16:00:20,542 Node[0] Epoch[7] Validation-top_k_accuracy_5=0.512620
2017-01-07 16:00:23,036 Node[0] Batch:    3745 fc_backward_weight             2.54587e-06	
2017-01-07 16:00:23,037 Node[0] Batch:    3745 fc_backward_weight             2.46527e-06	
2017-01-07 16:00:23,038 Node[0] Batch:    3745 fc_backward_weight             2.51372e-06	
2017-01-07 16:00:23,039 Node[0] Batch:    3745 fc_backward_weight             2.66109e-06	
2017-01-07 16:00:27,410 Node[0] Epoch[8] Resetting Data Iterator
2017-01-07 16:00:27,411 Node[0] Epoch[8] Time cost=6.868
2017-01-07 16:00:27,718 Node[0] Epoch[8] Validation-accuracy=0.105970
2017-01-07 16:00:27,719 Node[0] Epoch[8] Validation-top_k_accuracy_5=0.512620
2017-01-07 16:00:30,358 Node[0] Batch:    4213 fc_backward_weight             2.30903e-06	
2017-01-07 16:00:30,359 Node[0] Batch:    4213 fc_backward_weight             2.24373e-06	
2017-01-07 16:00:30,360 Node[0] Batch:    4213 fc_backward_weight             2.29058e-06	
2017-01-07 16:00:30,361 Node[0] Batch:    4213 fc_backward_weight             2.41351e-06	
2017-01-07 16:00:35,874 Node[0] Epoch[9] Resetting Data Iterator
2017-01-07 16:00:35,875 Node[0] Epoch[9] Time cost=8.156
2017-01-07 16:00:36,182 Node[0] Epoch[9] Validation-accuracy=0.105970
2017-01-07 16:00:36,183 Node[0] Epoch[9] Validation-top_k_accuracy_5=0.512620

Visualization

Now start using TensorBoard:

tensorboard --logdir=logs/

References

You might find these materials useful:

[1] Rohan #4: The vanishing gradient problem – A Year of Artificial Intelligence
[2] On the difficulty of training recurrent and deep neural networks - YouTube
[3] What is the vanishing gradient problem? - Quora